Contextual Advertising with Hadoop/MapReduce
Table of Contents
This is a pair programming assignment. If you are on a team, this means that you and your partner should be doing the entirety of this assignment side-by-side, on a single computer, where one person is "driving" and the other is "navigating." Take turns every so often who is driving; you should each spend approximately 50% of the time driving.
(Update on 3/3; I fixed the filenames below)
We will run the click rate projects in the following way:
hadoop com.sun.tools.javac.Main ClickRate.java jar cf clickrate.jar ClickRate*.class hadoop jar clickrate.jar ClickRate impressions clicks output
The first two lines should create the file clickrate.jar
. The last
line runs your program with input folders impressions
and
clicks
and output folder output
.
Introduction
The goal of this project is to continue to become familiar with Hadoop/MapReduce, a popular model for programming distributed systems that was developed by Google and publicly released by Apache. MapReduce is designed to solve a large class of problems in situations where the amount of data to be processed is extremely large. The idea is that every problem can be boiled down to a map step, which turns each element of the data into a key/value pair, and a reduce step, which combines elements with the same key. For example, we can count the number of times a given word appears in a text by (1) mapping each word to the key/value pair (word, 1) and (2) summing the values for each word.
Using the Cluster
Here is a link to the in-class lab that explains how to run jobs on the lab machines and on the Hadoop cluster: Hadoop Lab.
Part 1: Warmup Exercise - Inverted Index
An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. It is also one of the most popular MapReduce example. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents:
Doc1: Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.
Doct2: Buffalo are mammals.
we could construct the following inverted file index:
Buffalo -> Doc1, Doc2 buffalo -> Doc1 buffalo. -> Doc1 are -> Doc2 mammals. -> Doc2
Your goal is to build an inverted index of words to the documents which contain
them. You can try this on the files in the dataset located in
/Accounts/courses/cs348w16/gutenberg
on the lab machines. You can also
download this data set here.
Your end result should be something of the form: (word, docid[])
.
Part 2: Building a Summary Table
Suppose we work for an internet advertising company. We want to better target our ads to users, based on prior data. That is, given an advertising context, we would like to predict which of our available advertisements is most likely to result in a click.
The ad serving machines produce two types of log files: impression logs and click logs. Every time we display an advertisement to a customer, we add an entry to the impression log. Every time a customer clicks on an advertisement, we add an entry to the click log.
For this assignment, we will look at one particular feature: the page URL on which we will be showing the ad. This is called the "referrer" in the impressions logs. DO NOT use the hostname! Given a page URL and an ad id, we wish to determine the click through rate, which is just the percentage of impressions with the desired page URL and ad id that were clicked. For consistency's sake, you should output a double between 0 and 1.
Your goal is to build a summary table of click through rates, which could later be queried by ad serving machines to determine the best ad to display. Logically, this table is a sparse matrix with the axes page URL and ad id. The value represents the percentage of times an ad was clicked.
Please do this on the files in the dataset located here. These files were
generated by merging a large number of small files. If you want to work with the
set of smaller files (e.g. to create a smaller data set for testing), they are
here. Do not download either of these tarballs into your lab home directory.
The files are available on the lab machines in /Accounts/courses/cs348
, in the
folders impressions
, clicks
, impressions_merged
, and clicks_merged
. I am
posting them here so you can download them onto your personal machine if you
wish.
If you wish to download these files to an AWS cluster that you have created with
StarCluster, first ssh into the master node of the cluster. Once you have done
so, you can grab a file you want with the wget
command. For example:
wget http://www.cs.carleton.edu/faculty/dmusican/cs348w16/hadoop/click_through_data_merged.tar.gz
If you do choose to download them to your own computer or to the AWS cluster
(not your lab account), make sure to put the file in a directory of its own. On
any UNIX system, you can then use the tar
command to expand the file, for example:
tar xvfz click_through_data_merged.tar.gz
You'll have to do a little more than simply tokenize the text on whitespace. The
log files are stored in JSON format, with one JSON object per line. You should
use a json library, such as json-simple. You'll have to add the jar file for
your JSON library to your HADOOP_CLASSPATH
before compiling your program. (Go
back and look at the initial Hadoop lab; configuring HADOOP_CLASSPATH
was one
of the first things we did. For example, after downloading
json_simple-1.1.1.jar
from the json-simple website, I could update my
HADOOP_CLASSPATH
as follows:
export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/Accounts/dmusican/json-simple-1.1.1.jar
… where I would have to use the appropriate location above where I placed the jar file.
You should produce output in the following format:
[page_url, ad_id] click_rate
This can be achieved by making the key the string [page_url, ad_id]
, and the
value click_rate
.
Submission of part 2: two portions
Part 2a: Some form of analysis on the click/impressions data
This is an intermediate submission to demonstrate that you are making progress on this project. Submit some form of working Hadoop code that does some analysis on the click/impressions stream data, but precisely what sort of analysis that is can be up to you. Do something. The key things for making this work correctly are that it uses the correct data for the assignment, and that it successfully runs and produces some sort of reasonable output to do something.
Your submission should follow the guidelines in the submission section below. In
your README.txt
, describe briefly what your current version of the code does.
Part 2b: Final submission
This portion is where you submit the portion of the code that actually works as specified above.
Bonus Work
If you've finished the project, try implementing these extensions for a bonus point each:
- Write a query program on top of your feature index, which will accept a user-specified page url and return the ID of the ad to show.
- There are other features besides page URL which could help predict a click. Include both user agent and IP address as features in your table. You might consider adding prefixes to the features to distinguish the different types. (If you implement this feature, please include it as a separate file for grading purposes.)
Submission
- A zip file containing your code, including Java files and any necessary libraries.
- A
README.txt
file, containing detailed instructions on how to compile and run your code. Also indicate which bonus work, if any, you attempted. - Citations in a text file
credits.txt
.
Some last tips
You might find this sample code useful, which shows how to apply different mappers to different inputs.