Wednesday, July 31, 2013

Starting a Yarn Cluster on EC2 via Whirr

In this post, I will show you how to start a Yarn cluster on EC2, again using Whirr! Yes, with Whirr, you can provide cluster with ONE-CLICK!
  1. Install Whirr, check my other post for details how to install Whirr from source.
  2. Create your Yarn cluster definition file.
    1. Copy a template file from the recipes.
      cd whirr
      cp recipes/hadoop-yarn-ec2.properties my-yarn-cluster.properties
      
    2. Set your AWS credentials.
      vi ~/.bashrc
      export AWS_ACCESS_KEY_ID=
      export AWS_SECRET_ACCESS_KEY= #Go to your AWS management console to obtain these keys.
      source ~/.bashrc
      
    3. Use this AMI locator to find an image for your instance and set in my-yarn-cluster.properties correspondingly. The following is an example.
      whirr.image-id=us-east-1/ami-1ab3ce73
      whirr.location-id=us-east-1
      
      ♥ If you choose a different location, make sure whirr.image-id is updated too ♥
    4. Comment out the following line:
      #whirr.template=osFamily=UBUNTU,osVersionMatches=10.04,os64Bit=true,minRam=2048
      
  3. Now you are ready to launch the cluster.
    whirr launch-cluster --config my-yarn-cluster.properties
    
    It's output is as follows:
    Running on provider aws-ec2 using identity XXXXXXXXX
    createClientSideYarnProperties yarn.nodemanager.log-dirs:/tmp/nm-logs
    createClientSideYarnProperties yarn.nodemanager.remote-app-log-dir:/tmp/nm-remote-app-logs
    createClientSideYarnProperties yarn.nodemanager.aux-services:mapreduce.shuffle
    createClientSideYarnProperties yarn.nodemanager.aux-services.mapreduce.shuffle.class:org.apache.hadoop.mapred.ShuffleHandler
    createClientSideYarnProperties yarn.nodemanager.delete.debug-delay-sec:6000
    createClientSideYarnProperties yarn.app.mapreduce.am.staging-dir:/user
    createClientSideYarnProperties yarn.nodemanager.local-dirs:/data/tmp/hadoop-${user.name}
    createClientSideYarnProperties yarn.nodemanager.resource.memory-mb:4096
    Started cluster of 2 instances
    Cluster{instances=[Instance{roles=[hadoop-namenode, yarn-resourcemanager, mapreduce-historyserver], publicIp=204.236.250.181, privateIp=10.166.45.20, id=us-east-1/i-a9c6f5cb, nodeMetadata={id=us-east-1/i-a9c6f5cb, providerId=i-a9c6f5cb, name=hadoop-yarn-a9c6f5cb, location={scope=ZONE, id=us-east-1a, description=us-east-1a, parent=us-east-1, iso3166Codes=[US-VA]}, group=hadoop-yarn, imageId=us-east-1/ami-1ab3ce73, os={family=ubuntu, arch=paravirtual, version=10.04, description=ubuntu-us-east-1/images/ubuntu-lucid-10.04-amd64-server-20130704.manifest.xml, is64Bit=true}, status=RUNNING[running], loginPort=22, hostname=ip-10-166-45-20, privateAddresses=[10.166.45.20], publicAddresses=[204.236.250.181], hardware={id=m1.large, providerId=m1.large, processors=[{cores=2.0, speed=2.0}], ram=7680, volumes=[{type=LOCAL, size=10.0, device=/dev/sda1, bootDevice=true, durable=false}, {type=LOCAL, size=420.0, device=/dev/sdb, bootDevice=false, durable=false}, {type=LOCAL, size=420.0, device=/dev/sdc, bootDevice=false, durable=false}], hypervisor=xen, supportsImage=And(ALWAYS_TRUE,Or(isWindows(),requiresVirtualizationType(paravirtual)),ALWAYS_TRUE,is64Bit())}, loginUser=ubuntu, userMetadata={Name=hadoop-yarn-a9c6f5cb}}}, Instance{roles=[hadoop-datanode, yarn-nodemanager], publicIp=54.225.52.2, privateIp=10.164.60.16, id=us-east-1/i-c6cd99ae, nodeMetadata={id=us-east-1/i-c6cd99ae, providerId=i-c6cd99ae, name=hadoop-yarn-c6cd99ae, location={scope=ZONE, id=us-east-1a, description=us-east-1a, parent=us-east-1, iso3166Codes=[US-VA]}, group=hadoop-yarn, imageId=us-east-1/ami-1ab3ce73, os={family=ubuntu, arch=paravirtual, version=10.04, description=ubuntu-us-east-1/images/ubuntu-lucid-10.04-amd64-server-20130704.manifest.xml, is64Bit=true}, status=RUNNING[running], loginPort=22, hostname=ip-10-164-60-16, privateAddresses=[10.164.60.16], publicAddresses=[54.225.52.2], hardware={id=m1.large, providerId=m1.large, processors=[{cores=2.0, speed=2.0}], ram=7680, volumes=[{type=LOCAL, size=10.0, device=/dev/sda1, bootDevice=true, durable=false}, {type=LOCAL, size=420.0, device=/dev/sdb, bootDevice=false, durable=false}, {type=LOCAL, size=420.0, device=/dev/sdc, bootDevice=false, durable=false}], hypervisor=xen, supportsImage=And(ALWAYS_TRUE,Or(isWindows(),requiresVirtualizationType(paravirtual)),ALWAYS_TRUE,is64Bit())}, loginUser=ubuntu, userMetadata={Name=hadoop-yarn-c6cd99ae}}}]}
    
    You can log into instances using the following ssh commands:
    [hadoop-namenode+yarn-resourcemanager+mapreduce-historyserver]: ssh -i /home/meng/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no meng@204.236.250.181
    [hadoop-datanode+yarn-nodemanager]: ssh -i /home/meng/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no meng@54.225.52.2
    To destroy cluster, run 'whirr destroy-cluster' with the same options used to launch it.
    
    
  4. Don't forget to destroy the cluster after you are done using it. Your instances are running on EC2 and it only offers limited-time free usage. Otherwise you might receive a huge bill after a while like I did a month ago lol... That's another story...
    whirr destroy-cluster --config my-yarn-cluster.properties
    
Good luck playing with Whirr :)

No comments:

Post a Comment