Sunday, July 21, 2013

How to use Whirr to start a hadoop cluster on CloudStack

Apache Whirr is very powerful tool to provision clusters on multiple cloud platforms. This post shows an example how to provision a hadoop cluster on CloudStack via Whirr. Let's start!

  1. Install whirr.
    $git clone git://git.apache.org/whirr.git
    $cd whirr
    $mvn clean install
    
    Also if you are using Ubuntu or Debian based systems, you can install Whirr package from Cloudera repository following this link .
  2. Set environmental variables and test Whirr is correctly installed.
    • Add the following line to ~/.bashrc and source it.
      export PATH=$PATH:/path/to/whirr/bin
      
      $source ~/.bashrc
    • Test if whirr is successfully installed.
      $whirr version
      Apache Whirr 0.9.0-SNAPSHOT
      jclouds 1.5.8
  3. Configure Whirr to use Cloudstack.
    • First check that CloudStack management server is running.
    • Edit ~/.whirr/credentials. Set cloud provider connection details.
      PROVIDER=cloudstack
      IDENTITY=KvlfWOQsWAw061MAwoHwVu05P6zo4Nutd4mrf6g8Rv6UBEkTxA4pzpyV6DjK-ZBRsm09bDmGau8iLWEAGhUX_w
      CREDENTIAL=AIlNC5cgt2e2G_7Q8AvDS9wtpr8Xpk6lSQGCdP3U148XN1SOcN5y1oEAnFPl93c8FlJquvmkxfMOpZu7VlBd3Q
      ENDPOINT=:8080/client/api
      
      Replace the above values with your own CloudStack configuration details. The Identity and credential are obtained from the management server. In CloudStack context, the identity refers to the API key and the credental refers to secret key. If you have a fresh Cloudstack setup, this post shows you how to generate this key pair for a particular user.
  4. Prepare a properties file to define your hadoop cluster. The name of the file doesn't matter. Let's say it's called myhadoop.properties. Add the following key-value pairs to myhadoop.properites file.
    whirr.cluster-name=test3
    whirr.provider=cloudstack
    whirr.cluster-user=meng
    whirr.store-cluster-in-etc-hosts=true
    whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,2 hadoop-datanode+hadoop-tasktracker
    whirr.image-id=41595e8c-e835-11e2-890d-0023ae94f722
    whirr.hardware-id=61f55f10-9fa1-45d7-a7cb-131f7313ce83
    whirr.store-cluster-in-etc-hosts=true
    whirr.private-key-file=/home/meng/Desktop/whirr-key
    whirr.public-key-file=/home/meng/Desktop/whirr-key.pub
    whirr.bootstrap-user=root:password
    whirr.env.repo=cdh4
    whirr.hadoop.install-function=install_cdh_hadoop
    whirr.hadoop.configure-function=configure_cdh_hadoop
    
    1. whirr.cluster-name:create a name for your hadoop cluster.
    2. whirr.store-cluster-in-etc-hosts:store all cluster IPs and hostnames in /etc/hosts on each node.
    3. whirr.instance-templates:this specifies your cluster layout. One node acts as the jobtracker and namenode (the hadoop master). Another two slaves nodes act as both datanode and tasktracker.
    4. image-id:This tells CloudStack which template to use to start the cluster. Go to your management server UI, on the left pannel, choose Template, choose a template you want and locate it ID.
    5. hardware-id: This is the type of hardware to use for the cluster instances. Go to your management server UI, choose Service Offerings, choose an instance type and locate its ID.
    6. private/public-key-file:the key-pair used to login to each instance.You should use only RSA SSH keys, DSA keys are not accepted yet
    7. whirr.cluster-user:this is the name of the cluster admin user.
    8. whirr.bootstrap-user:this tells Jclouds(a cloud neutral library used by Whirr to manipulate different cloud infrastructure) which user name and password to use to login to each instance so that whirr can bootstrap and customize each instance. You must specifies this property if the image you used to run on each node has a hardwired usename/password.(e.g. the default template CentOS 5.5(64-bit) no GUI (KVM) comes with Cloudstack has a wired credential: root:password), otherwise you don't need to specify this property.
    9. whirr.env.repo: this tells whirr which repository to use to download source packages.
    10. whirr.hadoop.install-function/whirr.hadoop.configure-function:it's self-explanaotry.
  5. Launch a hadoop cluster.
    $whirr launch-cluster --config hadoop.properties
    
    Output:
    Running on provider cloudstack using identity KvlfWOQsWAw061MAwoHwVu05P6zo4Nutd4mrf6g8Rv6UBEkTxA4pzpyV6DjK-ZBRsm09bDmGau8iLWEAGhUX_w
    Bootstrapping cluster
    Configuring template for bootstrap-hadoop-datanode_hadoop-tasktracker
    Configuring template for bootstrap-hadoop-namenode_hadoop-jobtracker
    Starting 1 node(s) with roles [hadoop-datanode, hadoop-tasktracker]
    Starting 1 node(s) with roles [hadoop-namenode, hadoop-jobtracker]
    >> running InitScript{INSTANCE_NAME=bootstrap-hadoop-namenode_hadoop-jobtracker} on node(2ccf8f4b-64d7-45f3-acea-3794e3406ba9)
    >> running InitScript{INSTANCE_NAME=bootstrap-hadoop-datanode_hadoop-tasktracker} on node(4f5f7ffa-d7bd-4027-977a-726b2cc44932)
    ...
    
  6. I also have a trouble shooting post that list some errors and exceptions you might encounter while deploying the cluster on Whirr.

No comments:

Post a Comment