Monday, August 19, 2013

Mapping between EMR APIs and Yarn APIs

Before we get started, we need to remember that there are some equivalent concepts between Yarn and EMR. One application in Yarn is equal to a job flow in EMR. One job in Yarn is amount to a job step in EMR.
  1. RunJobFlow
    → ApplicationId  submitApplication(ApplicationSubmissionContext appContext) 
    The RunJobFlow api also include the cluster instantiate process while submitApplication assumes that a Yarn cluster is running.
  2. TerminateJobFlows
    →  void  killApplication(ApplicationId applicationId)  
  3. DescribeJobFlows
    →   ApplicationReport  getApplicationReport(ApplicationId appId)  

Thursday, August 8, 2013

One-Click Hadoop Cluster Launching and Expansion on Nimbus

This tutorial will guide you through the steps of launching and expanding a hadoop cluster on Nimbus in One-Click. Being able to launch one-click Hadoop cluster allow researchers and developers analysis data in an easily and time-effectively manner. In addition one-click Hadoop cluster expansion make it possible for data analysists to flexibly add nodes to their cluster whenever more capacity is needed. The steps are as follows.
  1. Get the source files from github.
    git clone https://github.com/kyrameng/OneClickHadoopClusterOnNimbus.git
    
  2. Copy the commands to bin directory and the cluster definition files to sample directory.
    cp launch-hadoop-cluster.sh  your_nimbus_client/bin/
    cp expand-hadoop-cluster.sh  your_nimbus_client/bin/
    cp hadoop-cluster-template.xml  your_nimbus_client/samples/
    cp hadoop-add-nodes.xml  your_nimbus_client/samples/
    
  3. Launch a cluster using the following command.
    bin/launch-hadoop-cluster.sh --cluster samples/hadoop-cluster-template.xml --nodes 1 --conf conf/hotel.conf --hours 1
    
    1. --nodes specifies how many slave nodes you want to have in this cluster. This command will launch a stand alone master node for you.
    2. --hours specifies how long this cluster will run.
    3. --cluster specifies which cluster definition file to use.
    4. --conf specifies the site where the cluster will be launched.
    Output of the above command.
    SSH known_hosts contained tilde:
      - '~/.ssh/known_hosts' --> '/home/meng/.ssh/known_hosts'
    
    Requesting cluster.
      - master-node: image 'hadoop-50GB-scheduler.gz', 1 instance
      - slave-nodes: image 'hadoop-50GB-scheduler.gz', 1 instance
    
    Context Broker:
        https://svc.uc.futuregrid.org:8443/wsrf/services/NimbusContextBroker
    
    Created new context with broker.
    
    Workspace Factory Service:
        https://svc.uc.futuregrid.org:8443/wsrf/services/WorkspaceFactoryService
    
    Creating workspace "master-node"... done.
      - 149.165.148.157 [ vm-148-157.uc.futuregrid.org ]
    
    Creating workspace "slave-nodes"... done.
      - 149.165.148.158 [ vm-148-158.uc.futuregrid.org ]
    
    Launching cluster-042... done.
    
    Waiting for launch updates.
      - cluster-042: all members are Running
      - wrote reports to '/home/meng/futuregrid/history/cluster-042/reports-vm'
    
    Waiting for context broker updates.
      - cluster-042: contextualized
      - wrote ctx summary to '/home/meng/futuregrid/history/cluster-042/reports-ctx/CTX-OK.txt'
      - wrote reports to '/home/meng/futuregrid/history/cluster-042/reports-ctx'
    
    SSH trusts new key for vm-148-157.uc.futuregrid.org  [[ master-node ]]
    
    SSH trusts new key for vm-148-158.uc.futuregrid.org  [[ slave-nodes ]]
    cluster-042
    Hadoop-Cluster-Handle cluster99
    
    Go to Hadoop Web UI to check your cluster status. e.g. 149.165.148.157:50030. Also this command will create a unique directory for every launched hadoop cluster to store their cluster definition files. Check your_nimbus_client/Hadoop-Cluster to explore your clusters.
  4. The last line of output is the hadoop cluster handle of the newly launched cluster. We will use this information to expand this cluster in the future.
  5. Use the following command to expand this cluster. Specify which cluster you want to add nodes to using the --handle option. "--nodes" option will specify how many slave nodes you want to add to a particular cluster.
    bin/expand-hadoop-cluster.sh --conf conf/hotel.conf --nodes 1 --hours 1 --cluster samples/hadoop-add-nodes.xml --handle cluster99
    
    Its output is as follows.
    SSH known_hosts contained tilde:
      - '~/.ssh/known_hosts' --> '/home/meng/.ssh/known_hosts'
    
    Requesting cluster.
      - newly-added-slave-nodes: image 'ubuntujaunty-hadoop-ctx-pub_v8.gz', 1 instance
    
    Context Broker:
        https://svc.uc.futuregrid.org:8443/wsrf/services/NimbusContextBroker
    
    Created new context with broker.
    
    Workspace Factory Service:
        https://svc.uc.futuregrid.org:8443/wsrf/services/WorkspaceFactoryService
    
    Creating workspace "newly-added-slave-nodes"... done.
      - 149.165.148.159 [ vm-148-159.uc.futuregrid.org ]
    
    
    Launching cluster-043... done.
    
    Waiting for launch updates.
      - cluster-043: all members are Running
      - wrote reports to '/home/meng/futuregrid/history/cluster-043/reports-vm'
    
    Waiting for context broker updates.
      - cluster-043: contextualized
      - wrote ctx summary to '/home/meng/futuregrid/history/cluster-043/reports-ctx/CTX-OK.txt'
      - wrote reports to '/home/meng/futuregrid/history/cluster-043/reports-ctx'
    
    SSH trusts new key for vm-148-159.uc.futuregrid.org  [[ newly-added-slave-nodes ]]
    

Wednesday, August 7, 2013

CentOS Commands Notes

  1. Check routing information
    route
    
  2. Configure a service to start automatically on boot.
    chkconfig service_name on
    
  3. Check the status of a particular port
    netstat | grep port_num
    
  4. nfs related commands. Export all directories listed in /etc/exports.
    exportfs -a
    
    Do not export a directory.
    export -u directory
    
  5. Check routing information.
    ip route
    
  6. Shut down a bridge
    ifconfig bridge_name down
    
    Delete a brige.
    brctl delbr bridge_name
    

Wednesday, July 31, 2013

Starting a Yarn Cluster on EC2 via Whirr

In this post, I will show you how to start a Yarn cluster on EC2, again using Whirr! Yes, with Whirr, you can provide cluster with ONE-CLICK!
  1. Install Whirr, check my other post for details how to install Whirr from source.
  2. Create your Yarn cluster definition file.
    1. Copy a template file from the recipes.
      cd whirr
      cp recipes/hadoop-yarn-ec2.properties my-yarn-cluster.properties
      
    2. Set your AWS credentials.
      vi ~/.bashrc
      export AWS_ACCESS_KEY_ID=
      export AWS_SECRET_ACCESS_KEY= #Go to your AWS management console to obtain these keys.
      source ~/.bashrc
      
    3. Use this AMI locator to find an image for your instance and set in my-yarn-cluster.properties correspondingly. The following is an example.
      whirr.image-id=us-east-1/ami-1ab3ce73
      whirr.location-id=us-east-1
      
      ♥ If you choose a different location, make sure whirr.image-id is updated too ♥
    4. Comment out the following line:
      #whirr.template=osFamily=UBUNTU,osVersionMatches=10.04,os64Bit=true,minRam=2048
      
  3. Now you are ready to launch the cluster.
    whirr launch-cluster --config my-yarn-cluster.properties
    
    It's output is as follows:
    Running on provider aws-ec2 using identity XXXXXXXXX
    createClientSideYarnProperties yarn.nodemanager.log-dirs:/tmp/nm-logs
    createClientSideYarnProperties yarn.nodemanager.remote-app-log-dir:/tmp/nm-remote-app-logs
    createClientSideYarnProperties yarn.nodemanager.aux-services:mapreduce.shuffle
    createClientSideYarnProperties yarn.nodemanager.aux-services.mapreduce.shuffle.class:org.apache.hadoop.mapred.ShuffleHandler
    createClientSideYarnProperties yarn.nodemanager.delete.debug-delay-sec:6000
    createClientSideYarnProperties yarn.app.mapreduce.am.staging-dir:/user
    createClientSideYarnProperties yarn.nodemanager.local-dirs:/data/tmp/hadoop-${user.name}
    createClientSideYarnProperties yarn.nodemanager.resource.memory-mb:4096
    Started cluster of 2 instances
    Cluster{instances=[Instance{roles=[hadoop-namenode, yarn-resourcemanager, mapreduce-historyserver], publicIp=204.236.250.181, privateIp=10.166.45.20, id=us-east-1/i-a9c6f5cb, nodeMetadata={id=us-east-1/i-a9c6f5cb, providerId=i-a9c6f5cb, name=hadoop-yarn-a9c6f5cb, location={scope=ZONE, id=us-east-1a, description=us-east-1a, parent=us-east-1, iso3166Codes=[US-VA]}, group=hadoop-yarn, imageId=us-east-1/ami-1ab3ce73, os={family=ubuntu, arch=paravirtual, version=10.04, description=ubuntu-us-east-1/images/ubuntu-lucid-10.04-amd64-server-20130704.manifest.xml, is64Bit=true}, status=RUNNING[running], loginPort=22, hostname=ip-10-166-45-20, privateAddresses=[10.166.45.20], publicAddresses=[204.236.250.181], hardware={id=m1.large, providerId=m1.large, processors=[{cores=2.0, speed=2.0}], ram=7680, volumes=[{type=LOCAL, size=10.0, device=/dev/sda1, bootDevice=true, durable=false}, {type=LOCAL, size=420.0, device=/dev/sdb, bootDevice=false, durable=false}, {type=LOCAL, size=420.0, device=/dev/sdc, bootDevice=false, durable=false}], hypervisor=xen, supportsImage=And(ALWAYS_TRUE,Or(isWindows(),requiresVirtualizationType(paravirtual)),ALWAYS_TRUE,is64Bit())}, loginUser=ubuntu, userMetadata={Name=hadoop-yarn-a9c6f5cb}}}, Instance{roles=[hadoop-datanode, yarn-nodemanager], publicIp=54.225.52.2, privateIp=10.164.60.16, id=us-east-1/i-c6cd99ae, nodeMetadata={id=us-east-1/i-c6cd99ae, providerId=i-c6cd99ae, name=hadoop-yarn-c6cd99ae, location={scope=ZONE, id=us-east-1a, description=us-east-1a, parent=us-east-1, iso3166Codes=[US-VA]}, group=hadoop-yarn, imageId=us-east-1/ami-1ab3ce73, os={family=ubuntu, arch=paravirtual, version=10.04, description=ubuntu-us-east-1/images/ubuntu-lucid-10.04-amd64-server-20130704.manifest.xml, is64Bit=true}, status=RUNNING[running], loginPort=22, hostname=ip-10-164-60-16, privateAddresses=[10.164.60.16], publicAddresses=[54.225.52.2], hardware={id=m1.large, providerId=m1.large, processors=[{cores=2.0, speed=2.0}], ram=7680, volumes=[{type=LOCAL, size=10.0, device=/dev/sda1, bootDevice=true, durable=false}, {type=LOCAL, size=420.0, device=/dev/sdb, bootDevice=false, durable=false}, {type=LOCAL, size=420.0, device=/dev/sdc, bootDevice=false, durable=false}], hypervisor=xen, supportsImage=And(ALWAYS_TRUE,Or(isWindows(),requiresVirtualizationType(paravirtual)),ALWAYS_TRUE,is64Bit())}, loginUser=ubuntu, userMetadata={Name=hadoop-yarn-c6cd99ae}}}]}
    
    You can log into instances using the following ssh commands:
    [hadoop-namenode+yarn-resourcemanager+mapreduce-historyserver]: ssh -i /home/meng/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no meng@204.236.250.181
    [hadoop-datanode+yarn-nodemanager]: ssh -i /home/meng/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no meng@54.225.52.2
    To destroy cluster, run 'whirr destroy-cluster' with the same options used to launch it.
    
    
  4. Don't forget to destroy the cluster after you are done using it. Your instances are running on EC2 and it only offers limited-time free usage. Otherwise you might receive a huge bill after a while like I did a month ago lol... That's another story...
    whirr destroy-cluster --config my-yarn-cluster.properties
    
Good luck playing with Whirr :)

Sunday, July 28, 2013

Git Note

git checkout -b mybranch
git add filename/directory
git commit -m "comments"
git format-patch master --stdout > your_patch.patch
publican build --format=html --lang=en-US --common_contents=./Common_Contents --config=your_publican.cfg
Configure Username and email and check configuration .
git config --global user.name "John Doe"
git config --global user.email johndoe@example.com
git config --list
Clone a specific branch
git clone -b branch_name  repository
The following command will clone the master branch and put it in a directory named your_directory
git clone repository your_directory
Revert to an earlier version.
git checkout which_commit
Push local changes to remote repositories.
git push https://username:password@github.com/name/something.git

Sunday, July 21, 2013

Cloudstack EMR API developement series Episode 1:how to add a basic launchHadoopCluster API to CloudStack.

The aim of my GSOC project is to add EMR-like APIs to Cloudstack, so that users can take advantage of Whirr to provision hadoop cluster on Cloudstack. Instead of jumping to write a EMR-compaitable api directly, I created a simple API launchHadoopCluster just to get some sense how API in CloudStack is developed and how to pass parameters/get responses to and from CloudStack web service.

The lauchHadoopCluster API has a structure looks like below:
Request parameters

Parameter Names Description Required
config The config file used by Whirr to define a cluster true
Response Tags
Response Name Description
whirroutput The output of running whirr on CloudStack

  1. Checkout the latest CloudStack source code.
    git clone https://git-wip-us.apache.org/repos/asf/cloudstack.git
  2. Create a directory for the plugin off of the plugins folder. Under this new folder, create a code hierarchy like the following tree.
    |-- src
    |   `-- org
    |       `-- apache
    |           `-- cloudstack
    |               |-- api
    |               |   `-- command
    |               |       `-- user
    |               |           `-- emr
    |               `-- emr
    |-- target
    `-- test
    
  3. Create a pom.xml for the emr module. The contents of the file is as follows:
    <project xmlns="http://maven.apache.org/POM/4.0.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
    http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0
    <artifactId>cloud-plugin-api-emr
    <name>Apache CloudStack Plugin - Elastic Map Reduce Plugin
    <parent>
    <groupId>org.apache.cloudstack
    <artifactId>cloudstack-plugins
    <version>4.2.0-SNAPSHOT
    <relativePath>../../pom.xml
    </parent>
    <dependencies>
    <dependency>
    <groupId>org.apache.cloudstack
    <artifactId>cloud-api
    <version>${project.version}
    </dependency>
    <dependency>
    <groupId>org.apache.cloudstack
    <artifactId>cloud-utils
    <version>${project.version}
    </dependency>
    </dependencies>
    <build>
    <defaultGoal>install
    <sourceDirectory>src
    <testSourceDirectory>test
    </build>
    </project>
    
  4. Now I can open this project in NetBeans and begin to generate the source files. Navigate to plugins/api/echo/src/org/apache/cloudstack/emr, create an interface ElasticMapReduce.java that extends from PluggableService.
    package org.apache.cloudstack.emr;
    
    import com.cloud.utils.component.PluggableService;
    
    public interface ElasticMapReduce extends PluggableService { }
    
  5. Create an implementation of the interface. Name it ElasticMapReduceImpl.java.
    package org.apache.cloudstack.emr;
    
    import java.util.ArrayList;
    import java.util.List;
    import javax.ejb.Local;
    import org.apache.cloudstack.api.command.user.emr.LaunchHadoopClusterCmd;
    import org.apache.log4j.Logger;
    import org.springframework.stereotype.Component;
    
    @Component
    @Local(value = ElasticMapReduce.class)
    
    public class ElasticMapReduceImpl implements ElasticMapReduce{
        private static final Logger s_logger = Logger.getLogger(ElasticMapReduceImpl.class);
        
        public ElasticMapReduceImpl(){
            super();
        }
        
        @Override
        public List> getCommands() {
        List> cmdList = new ArrayList>();
        cmdList.add(LaunchHadoopClusterCmd.class);
        return cmdList;
        } 
    }
    
  6. Navigate to plugins/api/emr/src/org/apache/cloudstack/emr/cmd, create the source file for the command and its response.
    LaunchHadoopClusterCmd.java
    
    package org.apache.cloudstack.api.command.user.emr;
    
    
    import java.io.IOException;
    import java.io.OutputStream;
    import java.util.logging.Level;
    import java.util.logging.Logger;
    import org.apache.cloudstack.api.APICommand;
    import org.apache.cloudstack.api.BaseCmd;
    import org.apache.cloudstack.api.Parameter;
    
    @APICommand(name = "launchHadoopCluster", responseObject = LaunchHadoopClusterCmdResponse.class, description = "Launch a hadoop cluster using whirr on CloudStack", since ="4.2.0")
    public class LaunchHadoopClusterCmd extends BaseCmd{
        @Parameter(name="config", type=CommandType.STRING, required=true, description="the configuation file to define a cluster")
        
        private String config;
        private String cmdName = "launchHadoopCluster";
        private String output;
        
        @Override
        public void execute()  {
            LaunchHadoopClusterCmdResponse response = new LaunchHadoopClusterCmdResponse();
            response.setObjectName("launchHadoopCluster");
            response.setResponseName(getCommandName());
            
            String cmdToExec;
            cmdToExec = "whirr launch-cluster --config "+ config;
            try {
               OutputStream out = Runtime.getRuntime().exec(cmdToExec).getOutputStream();
               output = out.toString();
            } catch (IOException ex) {
                Logger.getLogger(LaunchHadoopClusterCmd.class.getName()).log(Level.SEVERE, null, ex);
            }
            response.setOutPut(output);
            this.setResponseObject(response);
        }
    
        @Override
        public String getCommandName() {
            return cmdName;
        }
    
        @Override
        public long getEntityOwnerId() {
            return 0;
        }
        
    }
    
    LaunchHadoopClusterCmdResponse.java
    
    package org.apache.cloudstack.api.command.user.emr;
    
    import com.cloud.serializer.Param;
    import com.google.gson.annotations.SerializedName;
    import org.apache.cloudstack.api.ApiConstants;
    import org.apache.cloudstack.api.BaseResponse;
    
    public class LaunchHadoopClusterCmdResponse extends BaseResponse {
        @SerializedName(ApiConstants.IS_ASYNC) @Param(description = "true if api is asynchronous")
        private Boolean isAsync;
        @SerializedName("output") @Param(description = "whirr output")
        private String output;
        
        public LaunchHadoopClusterCmdResponse(){
            
        }
        public void setAsync(Boolean isAsync) {
            this.isAsync = isAsync;
        }
     
        public boolean getAsync() {
            return isAsync;
        }
        public void setOutPut(String output) {
            this.output = output;
        }
    }
    
    
  7. Add the following dependency to cloudstack/client/pom.xml.
    <dependency>
    <groupId>org.apache.cloudstack</groupId>
    <artifactId>cloud-plugin-api-emr</artifactId>
    <version>${project.version}</version>
    </dependency>
    
    When you added the emr plugin to the client pom file, Maven will download and link the emr plugin for you on compilation and other goals that requires them.
  8. Update client/tomcatconf/componentContext.xml.in and add the following bean:
     <bean id="elasticMapReduceImpl" class="org.apache.cloudstack.emr.ElasticMapReduceImpl" />
    
  9. Update plugins/pom.xml to add the following module.
    <module>api/emr</module>
    
  10. Add the command to client/tomcatconf/commands.properties.in
    launchHadoopCluster=15
    
  11. Now lets compile your code and test it!
    1. Navigate to plugins/api/emr and run:
      mvn  clean install
      
    2. In cloudstack base directory run:
      mvn -pl client clean install
      
    3. Start the Management server UI.

Whirr Trouble Shooting

This post contains some troubleshooting tips which hopefully should help you get out of troubled waters when you install/use Whirr on CloudStack.
  • Case 1
    java.util.concurrent.ExecutionException: java.io.IOException: Too many instance failed while bootstrapping! 0 successfully started instances while 2 instances failed.

    If you instances go through these states(starting, running, destroying, expunging) quickly, it is highly probably that whirr cannot login to the instances to bootstrap them. If the template you use has a hardcoded username/password, add the following line to your cluster definition file to make Whirr aware of that.
    whirr.bootstrap-user=your_user_name:your_password
    
    The following exceptions also imply this type of issue.
    org.jclouds.rest.AuthorizationException: (root:pw[63d90337b21005ea9a4bb6d617d4e54e]@10.244.18.65:22) (root:pw[63d90337b21005ea9a4bb6d617d4e54e]@10.244.18.65:22) error acquiring {hostAndPort=10.244.18.65:22, loginUser=root, ssh=null, connectTimeout=60000, sessionTimeout=60000} (out of retries - max 7): Exhausted available authentication methods
    java.lang.NullPointerException: no credential found for root on node 206224c5-756a-43a4-9925-b4916a1cd585
  • Case 2
    java.util.concurrent.ExecutionException: org.jclouds.cloudstack.AsyncJobException: job AsyncJob{accountId=789c4968-e835-11e2-890d-0023ae94f722, cmd=org.apache.cloudstack.api.command.user.vm.DeployVMCmd, created=Wed Jul 10 11:03:50 EDT 2013, id=f6c6ebde-2694-4c55-82c2-8b6305861d1b, instanceId=null, instanceType=null, progress=0, result=null, resultCode=FAIL, resultType=object, status=FAILED, userId=78a3f050-e835-11e2-890d-0023ae94f722, error=AsyncJobError{errorCode=INSUFFICIENT_CAPACITY_ERROR, errorText=Unable to create a deployment for VM[User|you-makes-me-stronger-e6d]}} failed with exception AsyncJobError{errorCode=INSUFFICIENT_CAPACITY_ERROR, errorText=Unable to create a deployment for VM[User|you-makes-me-stronger-e6d]}
    This implies that the host does not have enough capacity to launch VM. Go to CloudStack Management server UI → Dashboard → system capacity → fetch latest to check the low available resources. Maybe you are running out of IPs that can be assigned or have less secondary storage available. Also, this type of error happens frequently for users who deploy CloudStack on a small scale and have many failed launching attempts already. By default, if you failed to launch a VM, CloudStack will wait for 24 hours before reclaiming the resources assigned to the failed VMs. Go to CloudStack Management server UI → Global settings and change the value of expunge.delay and expunge.interval to a smaller value to ease the case.
  • Case 3
    com.cloud.exception.ResourceUnavailableException: Resource [Pod:1] is unreachable: Unable to apply dhcp entry on router 
    
    These error(s) occurred because there was an unknown DHCP Server on the same subnet as the system vm, but that unknown DHCP Server was not included in the CSMS IP Address Space for management devices.If there is no other DHCP server, simply restart the Virtual Router.
  • Last but not least, the cloudstack management-server.log and whirr.log are the main logs to help you track errors.