Hadoop/MapReduce Lab
Table of Contents
This lab will get you up and running with the Carleton CS Hadoop cluster. There are miscellaneous tips and resources here.
1 Setting up your account in our labs
In the department labs, you'll need to configure your account to know how to
find executables and libraries that Hadoop needs. In your home directory, on the
department computers, you should have a file in your home directory named
.bash_profile
. This contains a series of commands that are automatically run
whever you login to a desktop or to a terminal window. This file starts with a
.
, which is considered to be a "hidden" file. If you just issue a regular ls
command at the command line, you won't find it, and you may not see it in Finder
windows. To find the file, in a terminal window, issue the command ls -a
(the
"a" is for "all.") You should see the file. Open up the .bash_profile
in your
favorite editor, and add the following three lines to the bottom of it:
(Fourth line below added on Monday, 2/29)
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home export HADOOP_HOME=/usr/local/hadoop export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar export PATH=$PATH:/usr/local/hadoop/bin
If you have significantly modified your login setup on the department machines, you'll have to make changes to this approach. Ask for help. If you wish to do this on your own computer, you'll need to first install Hadoop. That's a multistep procedure that you'll have to spend some time on.
To make sure that the above worked, close your terminal window, open up a new one, and type
echo $HADOOP_CLASSPATH
If things are configured correctly, you should see:
/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/lib/tools.jar
2 A first MapReduce program
Download and unzip HadoopLab.zip.
The file Trivial.java
contains a very simple MapReduce job that simply copies
its input verbatim. However, even this trivial task requires quite a bit of
setup. Look over the code and make note of the following pieces:
main()
creates an instance ofTrivial
and runs it.- The nested class
IdentityMapper
inherits from theMapper
class. Note that themap()
method simply emits its input key/value pair as an output key. - The actual type parameters used in the declaration of
IdentityMapper
are really important, since they represent which types you expect your input and output elements to be. In this case we expect the input keys to have typeLongWritable
(that's writable as in "can be written to disk," not as in "modifiable," although they are modifiable) and the input values to have typeText
. The output types of the Mapper can be anything as long as (1) the keys implementWritableComparable
, (2) the values implementWritable
, and (3) the types match the input types for theReducer
task. - The nested class
IdentityReducer
inherits from theReducer
class. Note that its input types (LongWritable
andText
) match the output types of the mapper. Thereduce()
receives an input key and a set of values for that key; it simply outputs the keys and values to the result file.
3 Compiling a MapReduce program
In order to compile your program, issue the following command: (Don't type the
$
or anything to the left of it, it's there to show you you're typing at a
terminal prompt.)
$ hadoop com.sun.tools.javac.Main Trivial.java
You'll then need to create a jar file containing your class files. Issue the following command to do this:
$ jar cf trivial.jar Trivial*.class
You should now have a jar file trivial.jar
in your directory.
4 Running Hadoop jobs locally
For the most part, you should test your code on your local machine, and only run it on the Amazon cluster once you're pretty sure it's correct. You can run a Hadoop job on the lab machines with a command somewhat like the following:
$ hadoop jar [jarfile] [mainClass] [arg1] [arg2] ...
In our example, [jarfile]
is trivial.jar
(which you just created) and the
MainClass is Trivial
. The program takes two arguments: the input folder and
the output folder. The input folder shold have some input files to be processed,
and the output folder should not exist - it will be created by the job. The
folder trivialInput
has some text that you can use as an example input.
So, here we go! Run the MapReduce job as follows:
$ hadoop jar trivial.jar Trivial trivialInput output
You should see a lot of output scroll past the screen. When it's done, it should hopefully give you a bunch of summary statistics, indicating how many bytes were read and written, and so on.
If something went wrong, it might be because you already have an output
directory, especially if you've tried to do this multiple times. Hadoop will
fail if the output directory already exists, so make sure to remove it before
running the above command.
If all went as hoped, there should also be a folder output
with two files:
$ ls output _SUCCESS part-r-00000
part-r-00000
contains the job's results. Take a look! Opening big files in
editors is sometimes cumbersome, so you may wish to use the less
command to do
so:
$ less output/part-r-00000
5 Using a Hadoop cluster on Amazon
Hopefully by now you have created an AWS (Amazon Web Services) account, and gotten approved for AWS Educate credits. Without an AWS account you won't be able to do this portion of the lab; without the AWS Educate credits, you'll have to pay for the services yourself. Luckily, the work that we're doing shouldn't cost too much money in any case.
5.1 Setting up the cluster
Setting up a cluster manually on AWS is doable, but involves a lot of manual clicking and setting things up. We're instead going to create Hadoop clusters at AWS using a program called StarCluster, which is a set of Python scripts designed by some folks at MIT to make this more automatic.
StarCluster has already been involved on the department lab computers. If you want to use it on your own computer, you'll need to install it yourself.
The below instructions are based heavily on this StarCluster quick-start guide. Feel free to refer to it instead if you want to see different wording on the same instructions I have below.
5.1.1 Configure StarCluster with Amazon security credentials
To get started, log into one of the department lab machines, open a terminal window, and issue the command:
$ starcluster help
This will give you an error message indicating that the StarCluster config file does not exist yet, and then give you three options. Choose option 2, which is the one to write a config to your home directory.
Open up the file ~/.starcluster/config in your favorite editor. Within, you'll find a section that looks something like this:
[aws info] aws_access_key_id = #your aws access key id here aws_secret_access_key = #your secret aws access key here aws_user_id = #your 12-digit aws user id here
You'll need to update this section with information from your Amazon account. You'll first need to make an access key. To do this, visit aws.amazon.com, and near the top right corner, choose "My Account," then "Security Credentials." You may see a popup giving you the option to get started with IAM Users; if so, ignore this by clicking "Continue to Security Credentials." Then select "Access Keys (Access Key ID and Secret Access Key)," then click the button labeled "Create New Access Key." You'll then see a popup allowing you to show you your access key id and your secret access key. You can also click "Download Key File" if you like to get a file containing these two keys. I recommend downloading the file; this is your one and only chance to see your secret key without having to delete it and create another one, so you're likely best off downloading the key file so you can see this information later if you want it.
This access key id and secret access id are the codes that you need for the first two lines of your StarCluster config file. Copy each of them into the appropriate places in your config file.
The StarCluster config file also needs your 12 digit AWS user id. To find this, go to the top of the Amazon web page that you're on, and click your name, and from the dropdown menu, choose "My Account." Your 12 digit Account Id is what you want; copy and paste this into the StarCluster config file as well.
You'll next need to create ssh keys for connection with Amazon. After completing the above steps and saving your config file, issue the following command at the terminal:
$ starcluster createkey mykey -o ~/.ssh/mykey.rsa
(Technically, you can name your key something other than mykey
, but if you do,
you'll also need to change the [key]
section of the StarCluster config file.)
5.1.2 Further StarCluster configuration: get your cluster ready
In your StarCluster config file, there is a section labeled
[cluster smallcluster]
Shortly below that section header is a variable named CLUSTER_SIZE
that is set
by default to 2. This is where you configure the number of nodes you wish to
have in your cluster. 2 is boring. Change it to 4:
CLUSTER_SIZE = 4
You can actually make CLUSTER_SIZE
as big as you like, though it eventually
starts costing more money if you overdo it.
Near the bottom of this section of the config file, past a bunch of comments is
a line that lets you specify PLUGINS
. Change this line to appear as
PLUGINS = hadoop
(Added on Mon 2/29) Next, find the section that looks like this, and uncomment both lines:
# [plugin hadoop] # SETUP_CLASS = starcluster.plugins.hadoop.Hadoop
Phew, did you get all of that?
5.1.3 Starting up the cluster
Finally, let's start up your virtual cluster! At the terminal prompt, enter:
starcluster start mycluster
If this works, this takes a while: typically a few minutes. You'll see a long list of status messages as StarCluster does its work in setting up your Amazon cluster. When done, you should see a message saying "The cluster is now ready to use," which is then followed by a summary of some of the commands you can use.
Congratulations! If you had trouble making this work, ask for help.
5.2 Logging into the cluster
To log in to your cluster, issue this command:
$ starcluster sshmaster mycluster
The sshmaster
option indicates that you are connecting to the master
node of the cluster, and it's the machine from which you'll launch your Hadoop
jobs on the cluster. If this works, you may need to answer "yes" to a question
about an RSA key fingerprint, and then you should see a welcome screen with some
ASCII art for StarCluster, and find yourself at a UNIX prompt.
If this login does not work, ask for help.
5.3 Transfer files to the cluster
Type exit
(or ctrl-D) to go back to your local machine. Make sure you are in
the directory where your Trivial.java
file is. The version of Hadoop running
on the Amazon cluster is not the same one running in our labs, so you'll need to
transfer the Java file over to the cluster, and compile it again over
there. Issue the commands below:
$ starcluster put mycluster Trivial.java /root
While you're at it, also transfer over the directory of data that you've been working with:
$ starcluster put mycluster trivialInput /root
SSH into your cluster again:
$ starcluster sshmaster mycluster
Once you have connected, issue an ls
command to verify that your files are
there on the cluster as expected.
5.4 Compile your Hadoop program
As before, you'll need to compile your Hadoop program. Hopefully it has no errors; it's the same program you ran locally. We are using a different version of Hadoop on the cluster than in the labs, but the differences from a programming perspective are exceedingly small, and shouldn't affect anything we do. Since the versions of Hadoop are different, transferring the jar file that we created locally won't work.
You'll again need to set up some configuration settings on the cluster for the
locations of Java and Hadoop. As before, you'll need to add the following three
lines to a file. On the cluster, that file is .bashrc
(not .bashrc_profile
as it is on the department macs). Since we're in a command-line situation,
you'll need to use an editor that works in a terminal window. This is a great
occasion to learn how to use vim or emacs! All of that said, there is a less
capable editor known as pico
that's pretty easy to use if you're just looking
to make a quick edit. Issue the command
root@master:~# pico .bashrc
and then scroll down with the cursor keys to the bottom of the file. Copy and paste in the following two lines below:
export JAVA_HOME=/usr/lib/jvm/default-java export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
See if you can figure out how to save the file from there.
Once you have saved the above changes, log out of the cluster (type exit
),
then ssh back in again. If all worked out, the following compile commands should
now work:
root@master:~# hadoop com.sun.tools.javac.Main Trivial.java
If it works, then create a jar file again as you did earlier in this lab.
root@master:~# jar cf trivial.jar Trivial*.class
5.5 Get your input data into HDFS
You'll next need to transfer any input files onto HDFS, which is the cluster
file system. You've already transferred trivialInput
from your department
computer to the cluster, but that directory is currently sitting in your root
home directory on the master computer. That is simply sitting on the normal
local hard disk for that master computer; it is not the same thing as putting
the file into HDFS, which is the distributed file system with replication that
Hadoop uses. (Hadoop automatically distributes your data over the multiple
computers in your cluster as it best sees fit, and it also replicates it over
multiple locations.)
You can access the cluster filesystem using the command hadoop fs
. This has a
bunch of UNIX-like options you can use with it. (Don't type the #
with any of
the commands below, it's there to indicate that you're typing at a terminal
prompt on the cluster.)
# hadoop fs -help
Put the input data directory into HFDS using -put
. In the below, don't forget
to include that /
at the end of trivialInput
. Without it, it will copy in
the files in trivialInput
, but not within that directory.
(below modified on 2/29)
# hadoop fs -put trivialInput /user/root
Now you should be able to see the input file in HDFS:
# hadoop fs -ls Found 1 items drwxr-xr-x - root supergroup 0 2016-02-26 11:09 /user/root/trivialInput root@master:~# hadoop fs -ls trivialInput Found 2 items -rw-r--r-- 3 root supergroup 183773 2016-02-26 11:09 /user/root/trivialInput/pg19778.txt -rw-r--r-- 3 root supergroup 178845 2016-02-26 11:09 /user/root/trivialInput/pg2265.txt
Note that you DO NOT need to copy trivial.jar
into HDFS. It should stay in
your home directory on the cluster.
5.6 Running a job on the cluster
We now have all the pieces in place to run a job on the cluster! First, though, there are some nifty web-based tools you should take a look at.
You'll first need to find the public DNS name for your cluster master
computer. If you are still logged into the cluster, type exit
(or ctrl-D) so
you are back at your lab machine again. Then type:
starcluster listclusters
This will show you some information for your cluster. For me, it looks like:
----------------------------------------- mycluster (security group: @sc-mycluster) ----------------------------------------- Launch time: 2016-02-26 04:57:57 Uptime: 0 days, 00:51:58 Zone: us-east-1c Keypair: mykey EBS volumes: N/A Cluster nodes: master running i-86608d05 ec2-54-163-131-240.compute-1.amazonaws.com node001 running i-85608d06 ec2-174-129-164-12.compute-1.amazonaws.com node002 running i-84608d07 ec2-54-145-159-21.compute-1.amazonaws.com node003 running i-9b608d18 ec2-54-204-211-159.compute-1.amazonaws.com Total nodes: 4
You want to find the row associated with the master node. The rightmost string
there (for me, ec2-54-163-131-240.compute-1.amazonaws.com
) is the public DNS
that you can use to construct a URL for monitoring the status of your Hadoop
cluster in a web browser. You can copy and paste that DNS name into a browser
window, and add a port number. Different port numbers get you different
information:
- Port 50030 is the JobTracker interface for the cluster. It displays information on all of the jobs that are currently running.
- Port 50070 displays information about the cluster's distributed filesystem. This is probably of less immediate interest to you, but still interesting.
So for me, based on the above output from StarCluster, I would visit the following two URLs:
http://ec2-54-163-131-240.compute-1.amazonaws.com:50030 http://ec2-54-163-131-240.compute-1.amazonaws.com:50070
Make sure you can view those web pages, and ask for help if you need it.
Now, while you've got the first web page above visible (the 50030 one), go back to your terminal window, ssh to your cluster again, and start up your job just as you did on the lab machine. Note that the input and output paths now represent folders on HDFS.
$ starcluster sshmaster mycluster # hadoop jar trivial.jar Trivial trivialInput output
Again, you should get a bunch of output, hopefully ending with reports of success. If you keep refreshing the job tracker page, you should be able to see your job as either in progress or complete. Click on the job ID and explore the information available.
(Note that if you already have an output
directory in HDFS, possibly from a
previous run, the above hadoop
command will give you an error indicating as
such. You'll need to remove that output directory, or move it somewhere
else. Use hadoop fs -help
to look for commands to remove or move directories.)
Congrats! You've officially executed a MapReduce job on a real Hadoop cluster!
If you want to look at the job output back on your own lab computer, you'll first need to get it out of HDFS, then transfer it back to the lab:
# hadoop fs -get output/part-r-00000 . # ls # exit $ starcluster get mycluster part-r-00000 . $ head part-r-00000
5.7 Amazon billing info
Amazon charges money for these clusters. How much? Amazon's pricing model is a
bit confusing, but here's what I know. Unless you've changed the type of node in
your config file, it will create each node of type m1.small
. Here is the
webpage with pricing for so-called "previous generation" instances, which
includes m1.small
. (We're using previous generation instances instead of
current onces because StarCluster doesn't play nicely (yet) with the new
ones. The prices fluctuate all the time. At the time I'm writing this lab, the
node type m1.small
is selling for $0.044 per hour. If you started four of them,
that means you're paying approximately sixteen cents per hour. You can do the math
yourself on how much it will then cost you based on how much you leave it
running. Remember that if you successfully got your AWS Educate credits, that
gets you $100 that you can use.
(Someone will notice that Amazon offers an "AWS Free Tier" with t2.micro
instances. Sadly, t2.micro
doesn't seem to play nicely with
StarCluster. If you want to go this route, you'll have to manually configure the
cluster and Hadoop via Amazon's web interface instead of with StarCluster.)
StarCluster also creates an EBS (Elastic Block Storage) 8GB disk associated with each volume. Pricing on that is here. Again, prices change all the time for this. StarCluster creates so-called standard magnetic volumes, which at the moment I'm writing this lab that costs $0.05 per GB-month. In other words, you pay $0.05 for each GB that survives a month. Again, you can do the math to figure out what that means for your cluster, but this amounts to a small amount of money that should easily be covered by your AWS Educate credits.
You should regularly check your Amazon billing status. Amazon can show you how much charges you have incurred. This does not update in real-time, but often can be delayed by a few hours or up to a day. Until your clusters are shut down for good, you should be regularly checking this to make sure that you are not being charged more than you expect. To check your billing status, visit aws.amazon.com, and near the top right of the browser window, select "My Account," and then "Billing & Cost Management." You'll see your current charges there, but this number is very deceiving; at least sometimes for me, it is the amount you will be paying out of pocket, and often appears to be $0 so long as your AWS Educate credits are covering it. That's fine, but this doesn't help you know if you're blowing quickly through your AWS Educate money. The place I've found that's accurate is if you click the "Bill Details" button at the top right of the billing screen, and then click on the "Expand All" that appears below the summary. If you scroll down, you'll then see your actual charges as well as what is being credited. Again, this is not always refreshed right away, so come back often.
5.8 How to terminate your cluster
Don't miss this section! Leaving your cluster running forever will eventually cost you money!
At this point, you've got a choice: you can leave your cluster running for the
next week or so as you'll be working on assignments, or you can terminate the
cluster and create a new one later. If you terminate it and make a new one
later, you won't need to update your .starcluster/config
(that's done), but
you would need to update your .bashrc
on the cluster again once you've made a
new one. If you leave your cluster running, Amazon will continue to charge you
money for it. How to proceed is up to you.
When you are ready to shut down your cluster for good, issue this command, which will take bit of time to run:
$ starcluster terminate mycluster
Answer y
when it asks if you want to terminate the cluster.
To make sure everything is really shut down, you might issue the following commands after termination is complete:
$ starcluster listclusters $ starcluster listinstances $ starcluster listvolumes
Make sure to continue to check your AWS billing for the next few days after you've terminated your last cluster. If you accidentally left something running, you'll continue to get billed.
Starting up clusters again is pretty easy, so I've been choosing to shutdown my clusters when done with a particular session, and start them up again later. If I'm running a long Hadoop job, of course, I need to keep it running.
6 Do it all again for your partner
If two of you are working side-by-side in the lab, I want to make sure that
everyone is able to get an AWS cluster up and running. Repeat this lab for your
partner. It should go much faster once you know what you're doing! One change,
when you do it a second time: use the WordCount.java
program instead. See if
you can make the appropriate changes in commands throughout this lab in order to
run that program.
If you are working alone, repeat this lab again for WordCount.java
yourself. It's worthwhile to make sure you understand how to adjust all of the
commands for running a different program.
7 What to turn in
Submit screenshots of your web browser showing the job tracker information on
the AWS cluster, as well as the file system information. (These are the two web
pages you were able to link to via the 50030 and 50070 ports.) Take the
screenshots after you have run the WordCount.java
program.
Make sure you do this before terminating your cluster.
8 Next steps
The next assignment will be to create an inverted index on a set of documents. I'll have another assignment page with more info on that, but if you want to get started, here's the basic idea:
An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. It is also one of the most popular MapReduce example. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents:
Doc1: Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
Doc2: Buffalo are mammals.
we could construct the following inverted file index:
Buffalo -> Doc1, Doc2 buffalo -> Doc1 buffalo. -> Doc1 are -> Doc2 mammals. -> Doc2
Your goal is to build an inverted index of words to the documents which contain
them. Your end result should be something of the form: (word, docid[])
.
Again, there will be another assignment page officially explaining what to do, but if you can get this inverted index working or close to working, it will be a big head start.