Tips & Resources for Hadoop/MapReduce
Table of Contents
Testing
The whole point of MapReduce is to work on massive data sets. You should not do all of your development and testing using the real data set; instead, slice out a small subset of the data to test on, and only switch to the real data set when you have the smaller data set working.
Job Output
When running on the cluster, you should use a folder in your home directory as the output folder. Make sure that you delete this folder if you don't need it anymore. The easiest thing to do is just use the same output folder for every run, deleting it in between runs.
Also, please be careful not to copy massively large files into your home directory on the lab machines (it's ok to do so on the cluster).
@Override
annotations
A really easy mistake to make is to modify the type parameters for your Mapper
or Reducer
class, then forget to update the types in the signature of the
map()
or reduce()
. Because Java allows method overloading, it will just
assume you meant to overload the method name and use the default implementation
of the method, meaning your code is ignored. You can preemptively avoid this
issue by putting @Override
annotations on your method declarations. You should
then get a warning if the method signature does not match the overriden method's
signature.
Installing Hadoop on your personal machine
If you would prefer to test your code on your own computer, you can download a Hadoop distribution and run Hadoop code on it. I successfully managed to get Hadoop installed on both my Mac and my Linux machine using these standard install instructions. There is a link on that page with instructions for installing under Windows, though it appears that this is more complicated. I'd be curious to learn if you try it and make it work. Alternatively, the department offers a Linux virtual machine that you can run inside VirtualBox; you can install that, and install Hadoop within it. Here are instructions for downloading and installing the department Linux VM. (Note that the above link is to Carleton's wiki, which will require you to log in the first time you visit the wiki, but once you do, it won't take you to the right page. Come back here and click the link again.)
If you want help getting Hadoop working on your computer, feel free to talk to me or Mike Tie.