Distributed Computing via Apache Spark

Table of Contents

This is a pair programming assignment. If you are on a team, this means that you and your partner should be doing the entirety of this assignment side-by-side, on a single computer, where one person is "driving" and the other is "navigating." Take turns every 15 minutes and use a timer to help you manage this.

Introduction

The goal of this project is to continue to become familiar with Apache Spark, a popular tool for doing data analysis on distributed systems.

For this assignment, my recommendation is to do all of your development on one of our lab machines, rather than on the Google Cloud Platform. After you've got it working, you can try it on a GCP cluster and vary the number of processors to see what kind of speed increases you can get.

Part 1: Word-to-file index

One kind of index that web search engines need to make is one which connects words with documents. For example, if given the following 2 documents:

Doc1: Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

Doc2: Buffalo are mammals.

we could construct the following word-to-file index:

Buffalo -> Doc1, Doc2
buffalo -> Doc1
buffalo. -> Doc1
are -> Doc2
mammals. -> Doc2

Your goal is to build a word-to-file index of words contained in a series of files that I will supply. Specifically, you can try this on the files in the dataset located in /Accounts/courses/cs348/gutenberg on the lab machines. You can also download this data set here.

Each row of your end result should be something of the form: (word, [docid1, docid2, ...]). Write your output to a directory called part1out.txt.

Part 2: Click-through rates

Suppose you're doing some freelance work for an internet advertising company. (Someone has to make sure StackOverflow can pay its bills…) You've been asked to help the company determine the click-through rate for all of its ads that appear on webpages. Every time an ad is shown to someone on a particular website, that's called an impression. Yes, you might just roll your eyes when the ad appears, or ignore it entirely, but that's an impression nonetheless. The click-through rate is the number of clicks on a particular ad divided by the number of impressions. A high click-through rate presumably means the ad is working.

(Food for thought: even if an ad isn't clicked on, it might be nonetheless be having effects such as enhancing brand recognition. It also might be that some of the clicks might be accidental, especially if it's one of those really annoying ads that pretends to be a button of some sort. But on with the assignment…)

You've been provided with two types of log files: impression logs and click logs. Every time an advertisement is displayed on a website, an entry is added to the impression log. Every time a customer clicks on an advertisement, an entry is added to the click log.

Your goal is to build a summary table of click through rates. For example: if a particular advertisement appeared 10 times on reddit.com. and it was clicked on twice, the click-through rate for that ad would be 0.2. Note that the "same" ad (i.e., same text and graphics) could appear on a different website, but we would consider that to be a different ad for purposes of this calculation. Said differently, for every combination of URL and adId, you are looking for the ratio of the number of clicks to the number of impressions. Specifically, this should be a number ranging from 0 to 1.

Please do this on the files in the dataset located here. Do not download either of these tarballs into your lab home directory. The files are available on the lab machines in /Accounts/courses/cs348, in the folders impressions and clicks. I am posting them here so you can download them onto your personal machine if you wish.

If you do choose to download them to your own computer or to GCP (not your lab account), make sure to put the file in a directory of its own. On any UNIX system, you can then use the tar command to expand the file, for example:

tar xvfz click_through_data.tar.gz

You should produce output in a format something like the following:

[page_url, ad_id] click_rate

Submission of part 2: two portions

Part 2a: Some form of analysis on the click/impressions data

This is an intermediate submission to demonstrate that you are making progress on this project. Submit some form of working Spark code that does some analysis on the click/impressions stream data, but precisely what sort of analysis that is can be up to you. Do something. The key things for making this work correctly are that it uses the correct data for the assignment, and that it successfully runs and produces some sort of reasonable output to do something.

Your submission should follow the guidelines in the submission section below. In your README.txt, describe briefly what your current version of the code does.

Part 2b: Final submission

This portion is where you submit the portion of the code that actually works as specified above, and runs in a reasonable period of time. What's reasonable? For starters, verify that your CPUs are all being used; if you've goofed and written serial Python code, you'll see only one processor in use.

Submission

  • A zip file containing your code, including Java files and any necessary libraries.
  • A README.txt file, containing detailed instructions on how to compile and run your code. Also indicate which bonus work, if any, you attempted.
  • Citations in a text file credits.txt.