Hadoop/MapReduce Lab

1. Setting up your account in our labs
2. A first MapReduce program
3. Compiling a MapReduce program
4. Running Hadoop jobs locally
5. Using a Hadoop cluster on Amazon
6. Do it all again for your partner
7. What to turn in
8. Next steps

This lab will get you up and running with the Carleton CS Hadoop cluster. There are miscellaneous tips and resources here.

1 Setting up your account in our labs

In the department labs, you'll need to configure your account to know how to find executables and libraries that Hadoop needs. In your home directory, on the department computers, you should have a file in your home directory named .bash_profile. This contains a series of commands that are automatically run whever you login to a desktop or to a terminal window. This file starts with a ., which is considered to be a "hidden" file. If you just issue a regular ls command at the command line, you won't find it, and you may not see it in Finder windows. To find the file, in a terminal window, issue the command ls -a (the "a" is for "all.") You should see the file. Open up the .bash_profile in your favorite editor, and add the following three lines to the bottom of it:

(Fourth line below added on Monday, 2/29)

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar
export PATH=$PATH:/usr/local/hadoop/bin

If you have significantly modified your login setup on the department machines, you'll have to make changes to this approach. Ask for help. If you wish to do this on your own computer, you'll need to first install Hadoop. That's a multistep procedure that you'll have to spend some time on.

To make sure that the above worked, close your terminal window, open up a new one, and type

echo $HADOOP_CLASSPATH

If things are configured correctly, you should see:

/Library/Java/JavaVirtualMachines/jdk1.8.0_66.jdk/Contents/Home/lib/tools.jar

2 A first MapReduce program

Download and unzip HadoopLab.zip.

The file Trivial.java contains a very simple MapReduce job that simply copies its input verbatim. However, even this trivial task requires quite a bit of setup. Look over the code and make note of the following pieces:

main() creates an instance of Trivial and runs it.
The nested class IdentityMapper inherits from the Mapper class. Note that the map() method simply emits its input key/value pair as an output key.
The actual type parameters used in the declaration of IdentityMapper are really important, since they represent which types you expect your input and output elements to be. In this case we expect the input keys to have type LongWritable (that's writable as in "can be written to disk," not as in "modifiable," although they are modifiable) and the input values to have type Text. The output types of the Mapper can be anything as long as (1) the keys implement WritableComparable, (2) the values implement Writable, and (3) the types match the input types for the Reducer task.
The nested class IdentityReducer inherits from the Reducer class. Note that its input types (LongWritable and Text) match the output types of the mapper. The reduce() receives an input key and a set of values for that key; it simply outputs the keys and values to the result file.

3 Compiling a MapReduce program

In order to compile your program, issue the following command: (Don't type the $ or anything to the left of it, it's there to show you you're typing at a terminal prompt.)

$ hadoop com.sun.tools.javac.Main Trivial.java

You'll then need to create a jar file containing your class files. Issue the following command to do this:

$ jar cf trivial.jar Trivial*.class

You should now have a jar file trivial.jar in your directory.

4 Running Hadoop jobs locally

For the most part, you should test your code on your local machine, and only run it on the Amazon cluster once you're pretty sure it's correct. You can run a Hadoop job on the lab machines with a command somewhat like the following:

$ hadoop jar [jarfile] [mainClass] [arg1] [arg2] ...

In our example, [jarfile] is trivial.jar (which you just created) and the MainClass is Trivial. The program takes two arguments: the input folder and the output folder. The input folder shold have some input files to be processed, and the output folder should not exist - it will be created by the job. The folder trivialInput has some text that you can use as an example input.

So, here we go! Run the MapReduce job as follows:

$ hadoop jar trivial.jar Trivial trivialInput output

You should see a lot of output scroll past the screen. When it's done, it should hopefully give you a bunch of summary statistics, indicating how many bytes were read and written, and so on.

If something went wrong, it might be because you already have an output directory, especially if you've tried to do this multiple times. Hadoop will fail if the output directory already exists, so make sure to remove it before running the above command.

If all went as hoped, there should also be a folder output with two files:

$ ls output
_SUCCESS     part-r-00000

part-r-00000 contains the job's results. Take a look! Opening big files in editors is sometimes cumbersome, so you may wish to use the less command to do so:

$ less output/part-r-00000

5 Using a Hadoop cluster on Amazon

Hopefully by now you have created an AWS (Amazon Web Services) account, and gotten approved for AWS Educate credits. Without an AWS account you won't be able to do this portion of the lab; without the AWS Educate credits, you'll have to pay for the services yourself. Luckily, the work that we're doing shouldn't cost too much money in any case.

5.1 Setting up the cluster

Setting up a cluster manually on AWS is doable, but involves a lot of manual clicking and setting things up. We're instead going to create Hadoop clusters at AWS using a program called StarCluster, which is a set of Python scripts designed by some folks at MIT to make this more automatic.

StarCluster has already been involved on the department lab computers. If you want to use it on your own computer, you'll need to install it yourself.

The below instructions are based heavily on this StarCluster quick-start guide. Feel free to refer to it instead if you want to see different wording on the same instructions I have below.

5.1.1 Configure StarCluster with Amazon security credentials

To get started, log into one of the department lab machines, open a terminal window, and issue the command:

$ starcluster help

This will give you an error message indicating that the StarCluster config file does not exist yet, and then give you three options. Choose option 2, which is the one to write a config to your home directory.

Open up the file ~/.starcluster/config in your favorite editor. Within, you'll find a section that looks something like this:

[aws info]
aws_access_key_id = #your aws access key id here
aws_secret_access_key = #your secret aws access key here
aws_user_id = #your 12-digit aws user id here

You'll need to update this section with information from your Amazon account. You'll first need to make an access key. To do this, visit aws.amazon.com, and near the top right corner, choose "My Account," then "Security Credentials." You may see a popup giving you the option to get started with IAM Users; if so, ignore this by clicking "Continue to Security Credentials." Then select "Access Keys (Access Key ID and Secret Access Key)," then click the button labeled "Create New Access Key." You'll then see a popup allowing you to show you your access key id and your secret access key. You can also click "Download Key File" if you like to get a file containing these two keys. I recommend downloading the file; this is your one and only chance to see your secret key without having to delete it and create another one, so you're likely best off downloading the key file so you can see this information later if you want it.

This access key id and secret access id are the codes that you need for the first two lines of your StarCluster config file. Copy each of them into the appropriate places in your config file.

The StarCluster config file also needs your 12 digit AWS user id. To find this, go to the top of the Amazon web page that you're on, and click your name, and from the dropdown menu, choose "My Account." Your 12 digit Account Id is what you want; copy and paste this into the StarCluster config file as well.

You'll next need to create ssh keys for connection with Amazon. After completing the above steps and saving your config file, issue the following command at the terminal:

$ starcluster createkey mykey -o ~/.ssh/mykey.rsa

(Technically, you can name your key something other than mykey, but if you do, you'll also need to change the [key] section of the StarCluster config file.)

5.1.2 Further StarCluster configuration: get your cluster ready

In your StarCluster config file, there is a section labeled

[cluster smallcluster]

Shortly below that section header is a variable named CLUSTER_SIZE that is set by default to 2. This is where you configure the number of nodes you wish to have in your cluster. 2 is boring. Change it to 4:

CLUSTER_SIZE = 4

You can actually make CLUSTER_SIZE as big as you like, though it eventually starts costing more money if you overdo it.

Near the bottom of this section of the config file, past a bunch of comments is a line that lets you specify PLUGINS. Change this line to appear as

PLUGINS = hadoop

(Added on Mon 2/29) Next, find the section that looks like this, and uncomment both lines:

# [plugin hadoop]
# SETUP_CLASS = starcluster.plugins.hadoop.Hadoop

Phew, did you get all of that?

5.1.3 Starting up the cluster

Finally, let's start up your virtual cluster! At the terminal prompt, enter:

starcluster start mycluster

If this works, this takes a while: typically a few minutes. You'll see a long list of status messages as StarCluster does its work in setting up your Amazon cluster. When done, you should see a message saying "The cluster is now ready to use," which is then followed by a summary of some of the commands you can use.

Congratulations! If you had trouble making this work, ask for help.

5.2 Logging into the cluster

To log in to your cluster, issue this command:

$ starcluster sshmaster mycluster

The sshmaster option indicates that you are connecting to the master node of the cluster, and it's the machine from which you'll launch your Hadoop jobs on the cluster. If this works, you may need to answer "yes" to a question about an RSA key fingerprint, and then you should see a welcome screen with some ASCII art for StarCluster, and find yourself at a UNIX prompt.

If this login does not work, ask for help.

5.3 Transfer files to the cluster

Type exit (or ctrl-D) to go back to your local machine. Make sure you are in the directory where your Trivial.java file is. The version of Hadoop running on the Amazon cluster is not the same one running in our labs, so you'll need to transfer the Java file over to the cluster, and compile it again over there. Issue the commands below:

$ starcluster put mycluster Trivial.java /root

While you're at it, also transfer over the directory of data that you've been working with:

$ starcluster put mycluster trivialInput /root

SSH into your cluster again:

$ starcluster sshmaster mycluster

Once you have connected, issue an ls command to verify that your files are there on the cluster as expected.

5.4 Compile your Hadoop program

As before, you'll need to compile your Hadoop program. Hopefully it has no errors; it's the same program you ran locally. We are using a different version of Hadoop on the cluster than in the labs, but the differences from a programming perspective are exceedingly small, and shouldn't affect anything we do. Since the versions of Hadoop are different, transferring the jar file that we created locally won't work.

You'll again need to set up some configuration settings on the cluster for the locations of Java and Hadoop. As before, you'll need to add the following three lines to a file. On the cluster, that file is .bashrc (not .bashrc_profile as it is on the department macs). Since we're in a command-line situation, you'll need to use an editor that works in a terminal window. This is a great occasion to learn how to use vim or emacs! All of that said, there is a less capable editor known as pico that's pretty easy to use if you're just looking to make a quick edit. Issue the command

root@master:~# pico .bashrc

and then scroll down with the cursor keys to the bottom of the file. Copy and paste in the following two lines below:

export JAVA_HOME=/usr/lib/jvm/default-java
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

See if you can figure out how to save the file from there.

Once you have saved the above changes, log out of the cluster (type exit), then ssh back in again. If all worked out, the following compile commands should now work:

root@master:~# hadoop com.sun.tools.javac.Main Trivial.java

If it works, then create a jar file again as you did earlier in this lab.

root@master:~# jar cf trivial.jar Trivial*.class

5.5 Get your input data into HDFS

You'll next need to transfer any input files onto HDFS, which is the cluster file system. You've already transferred trivialInput from your department computer to the cluster, but that directory is currently sitting in your root home directory on the master computer. That is simply sitting on the normal local hard disk for that master computer; it is not the same thing as putting the file into HDFS, which is the distributed file system with replication that Hadoop uses. (Hadoop automatically distributes your data over the multiple computers in your cluster as it best sees fit, and it also replicates it over multiple locations.)

You can access the cluster filesystem using the command hadoop fs. This has a bunch of UNIX-like options you can use with it. (Don't type the # with any of the commands below, it's there to indicate that you're typing at a terminal prompt on the cluster.)

# hadoop fs -help

Put the input data directory into HFDS using -put. In the below, don't forget to include that / at the end of trivialInput. Without it, it will copy in the files in trivialInput, but not within that directory.

(below modified on 2/29)

# hadoop fs -put trivialInput /user/root

Now you should be able to see the input file in HDFS:

# hadoop fs -ls
Found 1 items
drwxr-xr-x   - root supergroup          0 2016-02-26 11:09 /user/root/trivialInput
root@master:~# hadoop fs -ls trivialInput
Found 2 items
-rw-r--r--   3 root supergroup     183773 2016-02-26 11:09 /user/root/trivialInput/pg19778.txt
-rw-r--r--   3 root supergroup     178845 2016-02-26 11:09 /user/root/trivialInput/pg2265.txt

Note that you DO NOT need to copy trivial.jar into HDFS. It should stay in your home directory on the cluster.

5.6 Running a job on the cluster

We now have all the pieces in place to run a job on the cluster! First, though, there are some nifty web-based tools you should take a look at.

You'll first need to find the public DNS name for your cluster master computer. If you are still logged into the cluster, type exit (or ctrl-D) so you are back at your lab machine again. Then type:

starcluster listclusters

This will show you some information for your cluster. For me, it looks like:

-----------------------------------------
mycluster (security group: @sc-mycluster)
-----------------------------------------
Launch time: 2016-02-26 04:57:57
Uptime: 0 days, 00:51:58
Zone: us-east-1c
Keypair: mykey
EBS volumes: N/A
Cluster nodes:
     master running i-86608d05 ec2-54-163-131-240.compute-1.amazonaws.com
    node001 running i-85608d06 ec2-174-129-164-12.compute-1.amazonaws.com
    node002 running i-84608d07 ec2-54-145-159-21.compute-1.amazonaws.com
    node003 running i-9b608d18 ec2-54-204-211-159.compute-1.amazonaws.com
Total nodes: 4

You want to find the row associated with the master node. The rightmost string there (for me, ec2-54-163-131-240.compute-1.amazonaws.com) is the public DNS that you can use to construct a URL for monitoring the status of your Hadoop cluster in a web browser. You can copy and paste that DNS name into a browser window, and add a port number. Different port numbers get you different information:

Port 50030 is the JobTracker interface for the cluster. It displays information on all of the jobs that are currently running.
Port 50070 displays information about the cluster's distributed filesystem. This is probably of less immediate interest to you, but still interesting.

So for me, based on the above output from StarCluster, I would visit the following two URLs:

http://ec2-54-163-131-240.compute-1.amazonaws.com:50030
http://ec2-54-163-131-240.compute-1.amazonaws.com:50070

Make sure you can view those web pages, and ask for help if you need it.

Now, while you've got the first web page above visible (the 50030 one), go back to your terminal window, ssh to your cluster again, and start up your job just as you did on the lab machine. Note that the input and output paths now represent folders on HDFS.

$ starcluster sshmaster mycluster
# hadoop jar trivial.jar Trivial trivialInput output

Again, you should get a bunch of output, hopefully ending with reports of success. If you keep refreshing the job tracker page, you should be able to see your job as either in progress or complete. Click on the job ID and explore the information available.

(Note that if you already have an output directory in HDFS, possibly from a previous run, the above hadoop command will give you an error indicating as such. You'll need to remove that output directory, or move it somewhere else. Use hadoop fs -help to look for commands to remove or move directories.)

Congrats! You've officially executed a MapReduce job on a real Hadoop cluster!

If you want to look at the job output back on your own lab computer, you'll first need to get it out of HDFS, then transfer it back to the lab:

# hadoop fs -get output/part-r-00000 .
# ls
# exit
$ starcluster get mycluster part-r-00000 .
$ head part-r-00000

5.7 Amazon billing info

Amazon charges money for these clusters. How much? Amazon's pricing model is a bit confusing, but here's what I know. Unless you've changed the type of node in your config file, it will create each node of type m1.small. Here is the webpage with pricing for so-called "previous generation" instances, which includes m1.small. (We're using previous generation instances instead of current onces because StarCluster doesn't play nicely (yet) with the new ones. The prices fluctuate all the time. At the time I'm writing this lab, the node type m1.small is selling for $0.044 per hour. If you started four of them, that means you're paying approximately sixteen cents per hour. You can do the math yourself on how much it will then cost you based on how much you leave it running. Remember that if you successfully got your AWS Educate credits, that gets you $100 that you can use.

(Someone will notice that Amazon offers an "AWS Free Tier" with t2.micro instances. Sadly, t2.micro doesn't seem to play nicely with StarCluster. If you want to go this route, you'll have to manually configure the cluster and Hadoop via Amazon's web interface instead of with StarCluster.)

StarCluster also creates an EBS (Elastic Block Storage) 8GB disk associated with each volume. Pricing on that is here. Again, prices change all the time for this. StarCluster creates so-called standard magnetic volumes, which at the moment I'm writing this lab that costs $0.05 per GB-month. In other words, you pay $0.05 for each GB that survives a month. Again, you can do the math to figure out what that means for your cluster, but this amounts to a small amount of money that should easily be covered by your AWS Educate credits.

You should regularly check your Amazon billing status. Amazon can show you how much charges you have incurred. This does not update in real-time, but often can be delayed by a few hours or up to a day. Until your clusters are shut down for good, you should be regularly checking this to make sure that you are not being charged more than you expect. To check your billing status, visit aws.amazon.com, and near the top right of the browser window, select "My Account," and then "Billing & Cost Management." You'll see your current charges there, but this number is very deceiving; at least sometimes for me, it is the amount you will be paying out of pocket, and often appears to be $0 so long as your AWS Educate credits are covering it. That's fine, but this doesn't help you know if you're blowing quickly through your AWS Educate money. The place I've found that's accurate is if you click the "Bill Details" button at the top right of the billing screen, and then click on the "Expand All" that appears below the summary. If you scroll down, you'll then see your actual charges as well as what is being credited. Again, this is not always refreshed right away, so come back often.

5.8 How to terminate your cluster

Don't miss this section! Leaving your cluster running forever will eventually cost you money!

At this point, you've got a choice: you can leave your cluster running for the next week or so as you'll be working on assignments, or you can terminate the cluster and create a new one later. If you terminate it and make a new one later, you won't need to update your .starcluster/config (that's done), but you would need to update your .bashrc on the cluster again once you've made a new one. If you leave your cluster running, Amazon will continue to charge you money for it. How to proceed is up to you.

When you are ready to shut down your cluster for good, issue this command, which will take bit of time to run:

$ starcluster terminate mycluster

Answer y when it asks if you want to terminate the cluster.

To make sure everything is really shut down, you might issue the following commands after termination is complete:

$ starcluster listclusters
$ starcluster listinstances
$ starcluster listvolumes

Make sure to continue to check your AWS billing for the next few days after you've terminated your last cluster. If you accidentally left something running, you'll continue to get billed.

Starting up clusters again is pretty easy, so I've been choosing to shutdown my clusters when done with a particular session, and start them up again later. If I'm running a long Hadoop job, of course, I need to keep it running.

6 Do it all again for your partner

If two of you are working side-by-side in the lab, I want to make sure that everyone is able to get an AWS cluster up and running. Repeat this lab for your partner. It should go much faster once you know what you're doing! One change, when you do it a second time: use the WordCount.java program instead. See if you can make the appropriate changes in commands throughout this lab in order to run that program.

If you are working alone, repeat this lab again for WordCount.java yourself. It's worthwhile to make sure you understand how to adjust all of the commands for running a different program.

7 What to turn in

Submit screenshots of your web browser showing the job tracker information on the AWS cluster, as well as the file system information. (These are the two web pages you were able to link to via the 50030 and 50070 ports.) Take the screenshots after you have run the WordCount.java program.

Make sure you do this before terminating your cluster.

8 Next steps

The next assignment will be to create an inverted index on a set of documents. I'll have another assignment page with more info on that, but if you want to get started, here's the basic idea:

An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. It is also one of the most popular MapReduce example. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents:

Doc1: Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

Doc2: Buffalo are mammals.

we could construct the following inverted file index:

Buffalo -> Doc1, Doc2
buffalo -> Doc1
buffalo. -> Doc1
are -> Doc2
mammals. -> Doc2

Your goal is to build an inverted index of words to the documents which contain them. Your end result should be something of the form: (word, docid[]).

Again, there will be another assignment page officially explaining what to do, but if you can get this inverted index working or close to working, it will be a big head start.