CS 324: Data summarization

For this assignment, you will explore a dataset to see what you can uncover.

I have placed a copy of the dataset titled "census-income" in the directory /Accounts/courses/cs324/census-income/census-income.data. Note that the directory /Accounts may be hidden in Finder under OS X, but you can get there via a terminal window: just do a "cd /Accounts" and you'll be there. If you're determined to open it up in Finder, once you've navigated to the /Accounts directory, you can type

open .
(that's "open", followed by a period) which will pop that directory up in Finder. A dictionary for this dataset can be found in the file census-income.names, in the same directory. Your goal is to learn everything that you can about the dataset. Answer the following questions as a starting point, but you should dig further. What more can you discover? What might come up by looking at subgroups of people, rather than at the entire dataset?

Starting questions:

Your pain is understood

This dataset is something of a mess. However, it is no more of a mess than any other dataset I work with, and it is better than some! Data "cleansing" is typically the first step in working with a new dataset. You'll discover a number of aspects of how the data is stored that you may find frustrating. Take it in stride -- it's all part of the experience of dealing with data that someone created for some some other purpose. Similarly, the documentation can be interesting as well. Feel free to chat with each other about the layout either in person or on Piazza.

Programming environment

For this particular assignment, I will leave it entirely up to you what kind of tool you want to use to calculate your answers, so long as it is something that is installed on our departmental machines. You can do most of this work in a spreadsheet if you like, though you may find it cumbersome to deal with some of the idiosyncracies in the data. Spreadsheets don't deal with wackiness and exceptions very well, though.

R is a fantastic tool that will get the job done with much less coding than "general purpose" programming languages; if you want to learn some R programming, there are many great tutorials out there.

Python, Java, or other general-purpose programming languages work too, and you're welcome to use them. Students have sometimes found in the past that Python runs too slowly for some of the assignments they use, and they instead use pypy. Pypy is an alternative Python programming environment that compiles your code, and it runs faster, It is installed on the department machines, and you are welcome to use it.

Whatever you choose, though, you should use R for the page of scatterplots. It's actually quite common to use one programming language (such as Python) for data manipulation, and another (such as R) for doing analysis.

You should submit an electronic document describing what you discover. You can use whatever word processing / document publishing environment you like, but the results you produce should be easily readable by the grader. This does not need to be longer than page, but should be appropriately formatted and editied as if it were a writing assignment (which it is).

Important Warnings

DO NOT COPY THIS DATA TO YOUR DEPARTMENT HOME DIRECTORY.

All of your home directories live on a single department server. That server is somewhat short on space, and is also short on network bandwidth. This dataset is 100 megabytes. If all of you copy it to your home directory, we may blow out the server. If you're using the department computers to do this assignment, you should copy the file to the directory /tmp and work from there. The directory /tmp may be deleted regularly (you can't count on the file staying there), but you can always copy a new version back there whenever you get onto a lab machine. If you wish to copy this data for use at home, you can install a secure copy program onto your machine and connect via skittles, one of the department servers.

Do not run your code on skittles, prism, or any other department servers. Your program will require a lot of memory, and you'll be a "bad department citizen" if you use a department server's resources for this purpose. Use one of the lab machines or you own computer.

Be smart about how you write your code. In the past, some students have tried to do this assignment without paying attention to how memory is allocated, how lists are implemented, etc. If you load your data into memory structures that grow dynamically as the program runs, you may find that it runs terribly slow. You should be particularly careful about this if you are using a language that makes list processing "brain-dead" easy: make sure you understand how those lists are working under the hood. You should make your program read all the data first and make sure that you can do it efficiently before coding up the rest of the things that you want to do.

Sample R code

Here is some R code I wrote to generate some scatterplots. R is installed on our department systems, and you are welcome to install it on your own machine if you wish. If you use this code or some variation thereof, you should turn in with your assignment an annotated version of this code where you add a comment for each line describing what it does. (That's why I've intentionally left comments out!)

census.data <- read.csv('/tmp/census-income.data',header=FALSE)
contcols <- c(1,31,40)
rows.to.plot <- census.data[sample(1:nrow(census.data),1000),]

quartz()
pairs(rows.to.plot[,contcols])

png("plotfile.png")
pairs(rows.to.plot[,contcols])
dev.off()

Grading

The important part of this assignment is answering the bulleted questions above in your writeup correctly, along with turning in your code / spreadsheet / whatever that shows you you did it. You also need to indicate at least one sort of relationship you find in the data that isn't immediately obvious from the above questions. The readability and efficiency of the code or spreadsheet that you submit matters as well.