Tips & Resources for Hadoop/MapReduce

Table of Contents

Testing

The whole point of MapReduce is to work on massive data sets. You should not do all of your development and testing using the real data set; instead, slice out a small subset of the data to test on, and only switch to the real data set when you have the smaller data set working.

Job Output

When running on the cluster, you should use a folder in your home directory as the output folder. Make sure that you delete this folder if you don't need it anymore. The easiest thing to do is just use the same output folder for every run, deleting it in between runs.

Also, please be careful not to copy massively large files into your home directory on the lab machines (it's ok to do so on the cluster).

@Override annotations

A really easy mistake to make is to modify the type parameters for your Mapper or Reducer class, then forget to update the types in the signature of the map() or reduce(). Because Java allows method overloading, it will just assume you meant to overload the method name and use the default implementation of the method, meaning your code is ignored. You can preemptively avoid this issue by putting @Override annotations on your method declarations. You should then get a warning if the method signature does not match the overriden method's signature.

Installing Hadoop on your personal machine

If you would prefer to test your code on your own computer, you can download a Hadoop distribution and run Hadoop code on it. I successfully managed to get Hadoop installed on both my Mac and my Linux machine using these standard install instructions. There is a link on that page with instructions for installing under Windows, though it appears that this is more complicated. I'd be curious to learn if you try it and make it work. Alternatively, the department offers a Linux virtual machine that you can run inside VirtualBox; you can install that, and install Hadoop within it. Here are instructions for downloading and installing the department Linux VM. (Note that the above link is to Carleton's wiki, which will require you to log in the first time you visit the wiki, but once you do, it won't take you to the right page. Come back here and click the link again.)

If you want help getting Hadoop working on your computer, feel free to talk to me or Mike Tie.

Author: Written and modified by by Laura Effinger-Dean, Jeannie Albrecht, and Dave Musicant

Created: 2016-02-26 Fri 09:17

Validate