Apache Spark Lab

Table of Contents

This lab will get you up and running with Apache Spark.

1 Setting up your account in our labs

In the department labs, you'll need to configure your account to know how to find executables and libraries that Spark needs. In your home directory, on the department computers, you should have a file in your home directory named .bash_profile. This contains a series of commands that are automatically run whever you login to a desktop or to a terminal window. This file starts with a ., which is considered to be a "hidden" file. If you just issue a regular ls command at the command line, you won't find it, and you may not see it in Finder windows. To find the file, in a terminal window, issue the command ls -a (the "a" is for "all.") You should see the file. Open up the .bash_profile in your favorite editor, and add the following to the bottom of it:

export PYSPARK_PYTHON=python3
export SPARK_HOME=/usr/local/spark-2.3.2-bin-hadoop2.7
export PATH=$PATH:/usr/local/spark-2.3.2-bin-hadoop2.7/bin

If you have significantly modified your login setup on the department machines, you'll have to make changes to this approach. Ask for help. If you wish to do this on your own computer, you'll need to first install Spark. That's a multistep procedure that you'll have to spend some time on.

To make sure that the above worked, close your terminal window, open up a new one, and type

echo $SPARK_HOME

If things are configured correctly, you should see:

/usr/local/spark-2.3.2-bin-hadoop2.7

2 A first program

Download and unzip sparklab.zip.

The file spark1.py contains a simple Spark program that counts the number of lines that have the letter a in them, and the number that have the letter b in them. This program is fairly short, but there's a lot going on. Read through it and see what you can make sense of.

In order to compile your program, issue the following command: (Don't type the $ or anything to the left of it, it's there to show you you're typing at a terminal prompt.)

$ spark-submit spark1.py almamater.txt

You should see a lot of logging output scroll past the screen. At the very bottom, though, should be the actual output of the program, telling you the number of lines where the letters a and b appear. If that works, try again with a bigger file:

$ spark-submit spark1.py pg84.txt

If you failed to get reasonable output, ask for help.

3 Using a Spark cluster on Google Cloud

The key advantage of using Spark is that it will automatically distribute work in parallel on a cluster of computers. For example, in the test program in the previous section, it will automatically split the file up into multiple pieces, do the line counts separately on each piece, and then aggregate it together. Running Spark as we just did on the lab machines is fabulous for testing code, but we don't actually have a Spark cluster set up locally. Instead, we're going to use Google Cloud Platform. Google has granted everyone in our course a $50 credit each to use in setting up their own Spark cluster, so we'll have to walk through setting it up.

3.1 Redeem your coupon

Your first step is to follow this link from our Moodle site to redeem your coupon. You'll end up at a screen asking for your email address – this must be your Carleton email address. In other words, the Google Cloud Platform work you'll be doing must link to your Carleton Google account in order for the coupon to work. I did this and it worked for me, though I don't have an easy way to repeat the process to provide step by step instructions. Hopefully it's reasonably clear; ask for help if you need it.

3.2 Get to the dashboard and set things up

In your browser, visit the Google Cloud Platform (GCP) console at https://console.cloud.google.com/. Make sure that you are logged in to your Carleton Google account. If it works for you like it did for me, you should hopefully have a project called "My First Project" automatically created and visible in the Project info box. If not, there is a project dropdown at the top which you can select and then use to create a project.

In the tab on the left, click "Billing." Verify that you can see your $50 in credits.

Set up a default region for yourself. This way all of your clusters will be built nearby instead of across the world, which should improve communications speeds. In the GCP console, from the menu on the left choose "Compute Engine" (it's in the second section down), then "Settings". For "Region", select us-central1. For zone, choose us-central1-c. (It doesn't really matter which one we pick, but it will be easier for me to manage if we all make the same choice.) Don't forget to click the "Save" button at the bottom.

Finally, you'll need to enable the Cloud Dataproc API. This is a one-time only setting. To do so, visit the Cloud Dataproc API registration page, select your project, then click "Continue." A step will follow inviting you to set up credentials; you can skip this.

As usual, ask for help if you are having trouble finding this information. You can also ask the people around you for help if they've found the information you're having trouble finding.

3.3 Set up and test the Spark cluster

Visit the GCP console at https://console.cloud.google.com. In the menu on the left, scroll down to "Dataproc" (it's pretty far down, under "Big Data", and click on "Clusters." (The first time you do this, you will need to click "Enable API" before you create a cluster.) Click on the "Create cluster" button. The "Name" box is where a name for your cluster should go. Use the name example-cluster so that the rest of my examples work. Stick with the rest of the defaults, then click the "Create" button at the bottom. You'll then see a screen that shows your cluster listed, with a status on the right of "Provisioning." You'll then need to wait for you cluster to come up. This may literally take a few minutes, so be patient.

Once this is complete, you should open up the Cloud Shell terminal window. To do so: in your GCP dashboard, there is a blue bar at the top, with a set of white icons on the top-right. The leftmost(?) icon looks like a terminal window, with ">_" symbols on it. Click that icon. Then you will likely need to click the text "start cloud shell" in the middle of your window. After a perhaps slightly uncomfortable amount of time, a Cloud Shell terminal window in your browser should pop up.

In the Cloud Shell terminal window, run some built-in Spark code to compute pi. Copy and paste the following into your web terminal prompt:

gcloud dataproc jobs submit spark --cluster example-cluster \
  --class org.apache.spark.examples.SparkPi \
  --jars file:///usr/lib/spark/examples/jars/spark-examples.jar -- 1000

If this runs successfully, you should see a bunch of logging info; but buried in the middle of it should be the program output with an approximation of pi. Look for it at the bottom of the "INFO" log statements shown.

3.4 Upload your program

The much more interesting thing to do, of course, is to run your Python program on your data. To do this, you'll want to upload your program, and upload your data. They each need to go to a different place.

First, let's upload your program, which is the easier part. In the Cloud Shell window in your browser, find the menu in the top right denoted with three vertical dots. Just to be confusing, Google has added two menus that look like this on the screen. You want the one that's on the top right of your Cloud Shell terminal window. Click that menu, then click "Upload File." Navigate to find your spark1.py program, and upload it to GCP. Once you've done so, type ls in the GCP terminal window to verify that your program is there. Ask for help if it isn't.

3.5 Get your input data into GS

You'll next need to transfer any input files onto GS, which is the Google Storage file system. You've already transferred spark1.py from your department computer to the cluster, but that file is now currently sitting in your home directory on the remote computer. This is not the same thing as putting the file into GS, which is a distributed file system with replication that Spark will use. (GS will automatically distribute your data over the multiple computers in your cluster as it best sees fit, and it also replicates it over multiple locations.)

To access your GS buckets, visit your GCP storage browser. You might or might not have a bucket created by default with an extremely long name that's hard to manage. If you do, I recommend creating a new one with an easier name. If you don't have a bucket created by default, you'll want to create one anyway. So in either case, click the "create bucket" button to create one. GCP requires that all buckets across all of Google have unique names. My recommendation is to use something like username-carleton-edu (where username is your username). Type in your bucket name, and then in the radio buttons below, choose "Regional" (this is slightly cheaper than multi-regional). Make sure that the Location dropdown shows us-central1.

If this works, you'll be taken to a bucket details screen where you can transfer files. Upload the almamater.txt file by clicking on the "Upload files" button.

Switch back to the GCP terminal window. You should now be able to see the file in GS:

$ gsutil ls gs://dmusicant-carleton-edu

3.6 View your job history

It can be useful to see all the Spark jobs that you've run, as well as the jobs that are running. To do this, in the Cloud Shell, issue the following command to set up an SSH tunnel for web interfaces.

$ gcloud compute ssh example-cluster-m -- -4 -N -L 8080:example-cluster-m:8088

The first time you run this command, you will be prompted to generate ssh keys. You'll see text like this:

Do you want to continue (Y/n)?  Y
Generating public/private rsa key pair.
Enter passphrase (empty for no passphrase):

You should answer "Y" to that, and then press Enter for the next two questions so as to make a blank passphrase.

You may get asked a question about if you meant a different zone; just answer no (n) if you see that question. Otherwise, after issuing this command, nothing should return; your terminal should just appear to hang. This means that it is now forwarding all web requests appropriately.

Next click on the Cloud Shell icon in the top right of your Cloud Shell terminal window that looks like a window with a big diamond in it. When you click it (if you find the right button), it will say "Preview on port 8080". Click on that. If all goes well, a new tab in your browser will open with a big "hadoop" in the top left, that will show a history of all the Hadoop and Spark jobs that you run. This is tricky to get to work, so ask for help if you need it.

Once you've got this working, go back to your terminal window. Now you've got a challenge. You want that above ssh command to keep running, but you also want to be able to run your commands. So we're going to put this job into the background. To do this, in the Cloud Shell, hit the key combo ctrl-z. You should see a message indicating that the command has stopped. Then type bg. That should put the job into the background. Switch back to the tab showing the Hadoop history, and reload it to verify that it's still working.

Finally, to run your program on the cluster, in the Cloud Shell, enter the following command (make sure to put your actual username in for the name of your GS bucket):

$ gcloud dataproc jobs submit pyspark --cluster=example-cluster spark1.py -- gs://username-carleton-edu/almamater.txt

Again, you should get a bunch of output, hopefully ending with reports of success. If you keep refreshing the job tracker page, you should be able to see your job as either in progress or complete. Click on the job ID and explore the information available.

Hopefully this worked! Let me know how I can help.

Finally, note that I am asking you to turn in a screenshot of the Hadoop job tracker webpage to verify you've gotten this working. You can use Command+Shift+3 to capture the whole screen, or Command-Shift-4 for a single window.

3.7 GCP billing info

Google charges money for these clusters. How much? First, there's the cost for the Dataproc cluster that we created. At the time I'm looking at this, it's something like $0.05 an hour for the cluster because we picked a small cheap one, but the prices can go way up to $1.00 an hour for megaclusters of 96 machines and terabytes of memory. Likewise, we're also paying for storage. The cost for the GS storage that we're using appears to be something like $0.026 per GB, per month. So presumably, you shouldn't have any troubles making things work with the $50 coupon that you received from Google. That said, it's worth keeping an eye on your balance.

You should regularly check your Google billing status. This does not update in real-time, but often can be delayed by a few hours. Until your clusters are shut down for good, you should be regularly checking this to make sure that you are not being charged more than you expect. To check your billing status, visit https://console.cloud.google.com, and choose Billing from the menu on the left.

3.8 How to terminate your cluster

Don't miss this section! Leaving your cluster running forever will use up all of your credits!

At this point, you've got a choice: you can leave your cluster running for the next week or so as you'll be working on assignments, or you can terminate the cluster and create a new one later. (Your Cloud Shell home directory should persist.) That's not too hard. If you leave your cluster running, Google will continue to charge you credits for it. How to proceed is up to you. I'd recommend shutting your cluster down and restarting it again when you need it.

Likewise, you have the choice of whether to delete your GS buckets or to leave them. If your data is small, it appears that the charges are so incremental you likely should just leave them for the convenience. But it's up to you.

When you are ready to shut down your cluster, visit the GCP console, and in the menu on the left, choose "Dataproc," then "Clusters." Click the checkbox next to the name of your cluster, then click "Delete" which appears in blue text in a strip higher up on the screen. It will take some time to delete, but it eventually will.

Likewise, to delete your GS bucket(s), in the GCP console menu, find "Storage", then "Browser". GCP might have made more than one bucket for you. You can delete them all.

Make sure to continue to check your GCP billing for the next few days after you've terminated your last cluster. If you accidentally left something running, you'll continue to get billed.

Starting up clusters again is pretty easy, so I've been choosing to shutdown my clusters when done with a particular session, and start them up again later. If I'm running a long job, of course, I need to keep it running.

4 Do it all again for your partner

If two of you are working side-by-side in the lab, I want to make sure that everyone is able to get an AWS cluster up and running. Repeat this lab for your partner. It should go much faster once you know what you're doing! One change, when you do it a second time: use the wordcount.py program instead. See if you can make the appropriate changes in commands throughout this lab in order to run that program.

If you are working alone, repeat this lab again for wordcount.py yourself. It's worthwhile to make sure you understand how to adjust all of the commands for running a different program.

5 What to turn in

Submit screenshots of your web browser showing the Hadoop/Spark job tracker information on the cluster.

6 Done early?

If you get through this quickly and are looking for more, look at this sample k-means code written with Spark. After you are done screaming about how short it is, look through it carefully to try to understand what it's doing. You will likely find the Spark Python API useful. This RDD programming guide is also pretty helpful.