Begin this summer..: July 2013

Wednesday, July 31, 2013

Starting a Yarn Cluster on EC2 via Whirr

In this post, I will show you how to start a Yarn cluster on EC2, again using Whirr! Yes, with Whirr, you can provide cluster with ONE-CLICK!

Install Whirr, check my other post for details how to install Whirr from source.

Create your Yarn cluster definition file.

Copy a template file from the recipes.

cd whirr
cp recipes/hadoop-yarn-ec2.properties my-yarn-cluster.properties

Set your AWS credentials.

vi ~/.bashrc
export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY= #Go to your AWS management console to obtain these keys.
source ~/.bashrc

Use this AMI locator to find an image for your instance and set in my-yarn-cluster.properties correspondingly. The following is an example.
```
whirr.image-id=us-east-1/ami-1ab3ce73
whirr.location-id=us-east-1
```
♥ If you choose a different location, make sure whirr.image-id is updated too ♥

Comment out the following line:

#whirr.template=osFamily=UBUNTU,osVersionMatches=10.04,os64Bit=true,minRam=2048

Now you are ready to launch the cluster.

whirr launch-cluster --config my-yarn-cluster.properties

It's output is as follows:

Running on provider aws-ec2 using identity XXXXXXXXX
createClientSideYarnProperties yarn.nodemanager.log-dirs:/tmp/nm-logs
createClientSideYarnProperties yarn.nodemanager.remote-app-log-dir:/tmp/nm-remote-app-logs
createClientSideYarnProperties yarn.nodemanager.aux-services:mapreduce.shuffle
createClientSideYarnProperties yarn.nodemanager.aux-services.mapreduce.shuffle.class:org.apache.hadoop.mapred.ShuffleHandler
createClientSideYarnProperties yarn.nodemanager.delete.debug-delay-sec:6000
createClientSideYarnProperties yarn.app.mapreduce.am.staging-dir:/user
createClientSideYarnProperties yarn.nodemanager.local-dirs:/data/tmp/hadoop-${user.name}
createClientSideYarnProperties yarn.nodemanager.resource.memory-mb:4096
Started cluster of 2 instances
Cluster{instances=[Instance{roles=[hadoop-namenode, yarn-resourcemanager, mapreduce-historyserver], publicIp=204.236.250.181, privateIp=10.166.45.20, id=us-east-1/i-a9c6f5cb, nodeMetadata={id=us-east-1/i-a9c6f5cb, providerId=i-a9c6f5cb, name=hadoop-yarn-a9c6f5cb, location={scope=ZONE, id=us-east-1a, description=us-east-1a, parent=us-east-1, iso3166Codes=[US-VA]}, group=hadoop-yarn, imageId=us-east-1/ami-1ab3ce73, os={family=ubuntu, arch=paravirtual, version=10.04, description=ubuntu-us-east-1/images/ubuntu-lucid-10.04-amd64-server-20130704.manifest.xml, is64Bit=true}, status=RUNNING[running], loginPort=22, hostname=ip-10-166-45-20, privateAddresses=[10.166.45.20], publicAddresses=[204.236.250.181], hardware={id=m1.large, providerId=m1.large, processors=[{cores=2.0, speed=2.0}], ram=7680, volumes=[{type=LOCAL, size=10.0, device=/dev/sda1, bootDevice=true, durable=false}, {type=LOCAL, size=420.0, device=/dev/sdb, bootDevice=false, durable=false}, {type=LOCAL, size=420.0, device=/dev/sdc, bootDevice=false, durable=false}], hypervisor=xen, supportsImage=And(ALWAYS_TRUE,Or(isWindows(),requiresVirtualizationType(paravirtual)),ALWAYS_TRUE,is64Bit())}, loginUser=ubuntu, userMetadata={Name=hadoop-yarn-a9c6f5cb}}}, Instance{roles=[hadoop-datanode, yarn-nodemanager], publicIp=54.225.52.2, privateIp=10.164.60.16, id=us-east-1/i-c6cd99ae, nodeMetadata={id=us-east-1/i-c6cd99ae, providerId=i-c6cd99ae, name=hadoop-yarn-c6cd99ae, location={scope=ZONE, id=us-east-1a, description=us-east-1a, parent=us-east-1, iso3166Codes=[US-VA]}, group=hadoop-yarn, imageId=us-east-1/ami-1ab3ce73, os={family=ubuntu, arch=paravirtual, version=10.04, description=ubuntu-us-east-1/images/ubuntu-lucid-10.04-amd64-server-20130704.manifest.xml, is64Bit=true}, status=RUNNING[running], loginPort=22, hostname=ip-10-164-60-16, privateAddresses=[10.164.60.16], publicAddresses=[54.225.52.2], hardware={id=m1.large, providerId=m1.large, processors=[{cores=2.0, speed=2.0}], ram=7680, volumes=[{type=LOCAL, size=10.0, device=/dev/sda1, bootDevice=true, durable=false}, {type=LOCAL, size=420.0, device=/dev/sdb, bootDevice=false, durable=false}, {type=LOCAL, size=420.0, device=/dev/sdc, bootDevice=false, durable=false}], hypervisor=xen, supportsImage=And(ALWAYS_TRUE,Or(isWindows(),requiresVirtualizationType(paravirtual)),ALWAYS_TRUE,is64Bit())}, loginUser=ubuntu, userMetadata={Name=hadoop-yarn-c6cd99ae}}}]}

You can log into instances using the following ssh commands:
[hadoop-namenode+yarn-resourcemanager+mapreduce-historyserver]: ssh -i /home/meng/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no meng@204.236.250.181
[hadoop-datanode+yarn-nodemanager]: ssh -i /home/meng/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no meng@54.225.52.2
To destroy cluster, run 'whirr destroy-cluster' with the same options used to launch it.

Don't forget to destroy the cluster after you are done using it. Your instances are running on EC2 and it only offers limited-time free usage. Otherwise you might receive a huge bill after a while like I did a month ago lol... That's another story...
```
whirr destroy-cluster --config my-yarn-cluster.properties
```

Good luck playing with Whirr :)

Sunday, July 28, 2013

Git Note

git checkout -b mybranch
git add filename/directory
git commit -m "comments"
git format-patch master --stdout > your_patch.patch

publican build --format=html --lang=en-US --common_contents=./Common_Contents --config=your_publican.cfg

Configure Username and email and check configuration .

git config --global user.name "John Doe"
git config --global user.email johndoe@example.com
git config --list

Clone a specific branch

git clone -b branch_name  repository

The following command will clone the master branch and put it in a directory named your_directory

git clone repository your_directory

Revert to an earlier version.

git checkout which_commit

Push local changes to remote repositories.

git push https://username:password@github.com/name/something.git

Sunday, July 21, 2013

Cloudstack EMR API developement series Episode 1:how to add a basic launchHadoopCluster API to CloudStack.

The aim of my GSOC project is to add EMR-like APIs to Cloudstack, so that users can take advantage of Whirr to provision hadoop cluster on Cloudstack. Instead of jumping to write a EMR-compaitable api directly, I created a simple API launchHadoopCluster just to get some sense how API in CloudStack is developed and how to pass parameters/get responses to and from CloudStack web service.

The lauchHadoopCluster API has a structure looks like below:
Request parameters

Parameter Names	Description	Required
config	The config file used by Whirr to define a cluster	true

Response Tags

Response Name	Description
whirroutput	The output of running whirr on CloudStack

Checkout the latest CloudStack source code.

git clone https://git-wip-us.apache.org/repos/asf/cloudstack.git

Create a directory for the plugin off of the plugins folder. Under this new folder, create a code hierarchy like the following tree.

|-- src
|   `-- org
|       `-- apache
|           `-- cloudstack
|               |-- api
|               |   `-- command
|               |       `-- user
|               |           `-- emr
|               `-- emr
|-- target
`-- test

Create a pom.xml for the emr module. The contents of the file is as follows:

<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0
<artifactId>cloud-plugin-api-emr
<name>Apache CloudStack Plugin - Elastic Map Reduce Plugin
<parent>
<groupId>org.apache.cloudstack
<artifactId>cloudstack-plugins
<version>4.2.0-SNAPSHOT
<relativePath>../../pom.xml
</parent>
<dependencies>
<dependency>
<groupId>org.apache.cloudstack
<artifactId>cloud-api
<version>${project.version}
</dependency>
<dependency>
<groupId>org.apache.cloudstack
<artifactId>cloud-utils
<version>${project.version}
</dependency>
</dependencies>
<build>
<defaultGoal>install
<sourceDirectory>src
<testSourceDirectory>test
</build>
</project>

Now I can open this project in NetBeans and begin to generate the source files. Navigate to plugins/api/echo/src/org/apache/cloudstack/emr, create an interface ElasticMapReduce.java that extends from PluggableService.
```
package org.apache.cloudstack.emr;

import com.cloud.utils.component.PluggableService;

public interface ElasticMapReduce extends PluggableService { }
```

Create an implementation of the interface. Name it ElasticMapReduceImpl.java.

package org.apache.cloudstack.emr;

import java.util.ArrayList;
import java.util.List;
import javax.ejb.Local;
import org.apache.cloudstack.api.command.user.emr.LaunchHadoopClusterCmd;
import org.apache.log4j.Logger;
import org.springframework.stereotype.Component;

@Component
@Local(value = ElasticMapReduce.class)

public class ElasticMapReduceImpl implements ElasticMapReduce{
    private static final Logger s_logger = Logger.getLogger(ElasticMapReduceImpl.class);
    
    public ElasticMapReduceImpl(){
        super();
    }
    
    @Override
    public List> getCommands() {
    List> cmdList = new ArrayList>();
    cmdList.add(LaunchHadoopClusterCmd.class);
    return cmdList;
    } 
}

Navigate to plugins/api/emr/src/org/apache/cloudstack/emr/cmd, create the source file for the command and its response.

LaunchHadoopClusterCmd.java

package org.apache.cloudstack.api.command.user.emr;


import java.io.IOException;
import java.io.OutputStream;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.cloudstack.api.APICommand;
import org.apache.cloudstack.api.BaseCmd;
import org.apache.cloudstack.api.Parameter;

@APICommand(name = "launchHadoopCluster", responseObject = LaunchHadoopClusterCmdResponse.class, description = "Launch a hadoop cluster using whirr on CloudStack", since ="4.2.0")
public class LaunchHadoopClusterCmd extends BaseCmd{
    @Parameter(name="config", type=CommandType.STRING, required=true, description="the configuation file to define a cluster")
    
    private String config;
    private String cmdName = "launchHadoopCluster";
    private String output;
    
    @Override
    public void execute()  {
        LaunchHadoopClusterCmdResponse response = new LaunchHadoopClusterCmdResponse();
        response.setObjectName("launchHadoopCluster");
        response.setResponseName(getCommandName());
        
        String cmdToExec;
        cmdToExec = "whirr launch-cluster --config "+ config;
        try {
           OutputStream out = Runtime.getRuntime().exec(cmdToExec).getOutputStream();
           output = out.toString();
        } catch (IOException ex) {
            Logger.getLogger(LaunchHadoopClusterCmd.class.getName()).log(Level.SEVERE, null, ex);
        }
        response.setOutPut(output);
        this.setResponseObject(response);
    }

    @Override
    public String getCommandName() {
        return cmdName;
    }

    @Override
    public long getEntityOwnerId() {
        return 0;
    }
    
}

LaunchHadoopClusterCmdResponse.java

package org.apache.cloudstack.api.command.user.emr;

import com.cloud.serializer.Param;
import com.google.gson.annotations.SerializedName;
import org.apache.cloudstack.api.ApiConstants;
import org.apache.cloudstack.api.BaseResponse;

public class LaunchHadoopClusterCmdResponse extends BaseResponse {
    @SerializedName(ApiConstants.IS_ASYNC) @Param(description = "true if api is asynchronous")
    private Boolean isAsync;
    @SerializedName("output") @Param(description = "whirr output")
    private String output;
    
    public LaunchHadoopClusterCmdResponse(){
        
    }
    public void setAsync(Boolean isAsync) {
        this.isAsync = isAsync;
    }
 
    public boolean getAsync() {
        return isAsync;
    }
    public void setOutPut(String output) {
        this.output = output;
    }
}

Add the following dependency to cloudstack/client/pom.xml.
```
<dependency>
<groupId>org.apache.cloudstack</groupId>
<artifactId>cloud-plugin-api-emr</artifactId>
<version>${project.version}</version>
</dependency>
```
When you added the emr plugin to the client pom file, Maven will download and link the emr plugin for you on compilation and other goals that requires them.

Update client/tomcatconf/componentContext.xml.in and add the following bean:

 <bean id="elasticMapReduceImpl" class="org.apache.cloudstack.emr.ElasticMapReduceImpl" />

Update plugins/pom.xml to add the following module.
```
<module>api/emr</module>
```
Add the command to client/tomcatconf/commands.properties.in
```
launchHadoopCluster=15
```
Now lets compile your code and test it!
1. Navigate to plugins/api/emr and run:
```
mvn  clean install
```
2. In cloudstack base directory run:
```
mvn -pl client clean install
```
3. Start the Management server UI.

Whirr Trouble Shooting

This post contains some troubleshooting tips which hopefully should help you get out of troubled waters when you install/use Whirr on CloudStack.

Case 1

java.util.concurrent.ExecutionException: java.io.IOException: Too many instance failed while bootstrapping! 0 successfully started instances while 2 instances failed.

If you instances go through these states(starting, running, destroying, expunging) quickly, it is highly probably that whirr cannot login to the instances to bootstrap them. If the template you use has a hardcoded username/password, add the following line to your cluster definition file to make Whirr aware of that.

whirr.bootstrap-user=your_user_name:your_password

The following exceptions also imply this type of issue.

org.jclouds.rest.AuthorizationException: (root:pw[63d90337b21005ea9a4bb6d617d4e54e]@10.244.18.65:22) (root:pw[63d90337b21005ea9a4bb6d617d4e54e]@10.244.18.65:22) error acquiring {hostAndPort=10.244.18.65:22, loginUser=root, ssh=null, connectTimeout=60000, sessionTimeout=60000} (out of retries - max 7): Exhausted available authentication methods

java.lang.NullPointerException: no credential found for root on node 206224c5-756a-43a4-9925-b4916a1cd585

Case 2

java.util.concurrent.ExecutionException: org.jclouds.cloudstack.AsyncJobException: job AsyncJob{accountId=789c4968-e835-11e2-890d-0023ae94f722, cmd=org.apache.cloudstack.api.command.user.vm.DeployVMCmd, created=Wed Jul 10 11:03:50 EDT 2013, id=f6c6ebde-2694-4c55-82c2-8b6305861d1b, instanceId=null, instanceType=null, progress=0, result=null, resultCode=FAIL, resultType=object, status=FAILED, userId=78a3f050-e835-11e2-890d-0023ae94f722, error=AsyncJobError{errorCode=INSUFFICIENT_CAPACITY_ERROR, errorText=Unable to create a deployment for VM[User|you-makes-me-stronger-e6d]}} failed with exception AsyncJobError{errorCode=INSUFFICIENT_CAPACITY_ERROR, errorText=Unable to create a deployment for VM[User|you-makes-me-stronger-e6d]}

This implies that the host does not have enough capacity to launch VM. Go to CloudStack Management server UI → Dashboard → system capacity → fetch latest to check the low available resources. Maybe you are running out of IPs that can be assigned or have less secondary storage available. Also, this type of error happens frequently for users who deploy CloudStack on a small scale and have many failed launching attempts already. By default, if you failed to launch a VM, CloudStack will wait for 24 hours before reclaiming the resources assigned to the failed VMs. Go to CloudStack Management server UI → Global settings and change the value of expunge.delay and expunge.interval to a smaller value to ease the case.

Case 3
```
com.cloud.exception.ResourceUnavailableException: Resource [Pod:1] is unreachable: Unable to apply dhcp entry on router 
```
These error(s) occurred because there was an unknown DHCP Server on the same subnet as the system vm, but that unknown DHCP Server was not included in the CSMS IP Address Space for management devices.If there is no other DHCP server, simply restart the Virtual Router.
Last but not least, the cloudstack management-server.log and whirr.log are the main logs to help you track errors.

How to use Whirr to start a hadoop cluster on CloudStack

Apache Whirr is very powerful tool to provision clusters on multiple cloud platforms. This post shows an example how to provision a hadoop cluster on CloudStack via Whirr. Let's start!

Install whirr.
```
$git clone git://git.apache.org/whirr.git
$cd whirr
$mvn clean install
```
Also if you are using Ubuntu or Debian based systems, you can install Whirr package from Cloudera repository following this link .
Set environmental variables and test Whirr is correctly installed.
- Add the following line to ~/.bashrc and source it.
```
export PATH=$PATH:/path/to/whirr/bin
```
```
$source ~/.bashrc
```
- Test if whirr is successfully installed.
```
$whirr version
Apache Whirr 0.9.0-SNAPSHOT
jclouds 1.5.8
```
Configure Whirr to use Cloudstack.
- First check that CloudStack management server is running.
- Edit ~/.whirr/credentials. Set cloud provider connection details.
```
PROVIDER=cloudstack
IDENTITY=KvlfWOQsWAw061MAwoHwVu05P6zo4Nutd4mrf6g8Rv6UBEkTxA4pzpyV6DjK-ZBRsm09bDmGau8iLWEAGhUX_w
CREDENTIAL=AIlNC5cgt2e2G_7Q8AvDS9wtpr8Xpk6lSQGCdP3U148XN1SOcN5y1oEAnFPl93c8FlJquvmkxfMOpZu7VlBd3Q
ENDPOINT=:8080/client/api
```
  Replace the above values with your own CloudStack configuration details. The Identity and credential are obtained from the management server. In CloudStack context, the identity refers to the API key and the credental refers to secret key. If you have a fresh Cloudstack setup, this post shows you how to generate this key pair for a particular user.
Prepare a properties file to define your hadoop cluster. The name of the file doesn't matter. Let's say it's called myhadoop.properties. Add the following key-value pairs to myhadoop.properites file.
```
whirr.cluster-name=test3
whirr.provider=cloudstack
whirr.cluster-user=meng
whirr.store-cluster-in-etc-hosts=true
whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,2 hadoop-datanode+hadoop-tasktracker
whirr.image-id=41595e8c-e835-11e2-890d-0023ae94f722
whirr.hardware-id=61f55f10-9fa1-45d7-a7cb-131f7313ce83
whirr.store-cluster-in-etc-hosts=true
whirr.private-key-file=/home/meng/Desktop/whirr-key
whirr.public-key-file=/home/meng/Desktop/whirr-key.pub
whirr.bootstrap-user=root:password
whirr.env.repo=cdh4
whirr.hadoop.install-function=install_cdh_hadoop
whirr.hadoop.configure-function=configure_cdh_hadoop
```
1. whirr.cluster-name:create a name for your hadoop cluster.
2. whirr.store-cluster-in-etc-hosts:store all cluster IPs and hostnames in /etc/hosts on each node.
3. whirr.instance-templates:this specifies your cluster layout. One node acts as the jobtracker and namenode (the hadoop master). Another two slaves nodes act as both datanode and tasktracker.
4. image-id:This tells CloudStack which template to use to start the cluster. Go to your management server UI, on the left pannel, choose Template, choose a template you want and locate it ID.
5. hardware-id: This is the type of hardware to use for the cluster instances. Go to your management server UI, choose Service Offerings, choose an instance type and locate its ID.
6. private/public-key-file:the key-pair used to login to each instance.You should use only RSA SSH keys, DSA keys are not accepted yet
7. whirr.cluster-user:this is the name of the cluster admin user.
8. whirr.bootstrap-user:this tells Jclouds(a cloud neutral library used by Whirr to manipulate different cloud infrastructure) which user name and password to use to login to each instance so that whirr can bootstrap and customize each instance. You must specifies this property if the image you used to run on each node has a hardwired usename/password.(e.g. the default template CentOS 5.5(64-bit) no GUI (KVM) comes with Cloudstack has a wired credential: root:password), otherwise you don't need to specify this property.
9. whirr.env.repo: this tells whirr which repository to use to download source packages.
10. whirr.hadoop.install-function/whirr.hadoop.configure-function:it's self-explanaotry.

Launch a hadoop cluster.

$whirr launch-cluster --config hadoop.properties

Output:

Running on provider cloudstack using identity KvlfWOQsWAw061MAwoHwVu05P6zo4Nutd4mrf6g8Rv6UBEkTxA4pzpyV6DjK-ZBRsm09bDmGau8iLWEAGhUX_w
Bootstrapping cluster
Configuring template for bootstrap-hadoop-datanode_hadoop-tasktracker
Configuring template for bootstrap-hadoop-namenode_hadoop-jobtracker
Starting 1 node(s) with roles [hadoop-datanode, hadoop-tasktracker]
Starting 1 node(s) with roles [hadoop-namenode, hadoop-jobtracker]
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-namenode_hadoop-jobtracker} on node(2ccf8f4b-64d7-45f3-acea-3794e3406ba9)
>> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(4f5f7ffa-d7bd-4027-977a-726b2cc44932)
...

I also have a trouble shooting post that list some errors and exceptions you might encounter while deploying the cluster on Whirr.