Machine Learning and Data Mining: Getting Started

Assignment #1: Getting Started

For this assignment, you will explore a dataset to see what you can uncover.

I have placed a copy of the dataset titled "census-income" in the directory /Accounts/courses/cs377/census-income/census-income.data. A dictionary for this dataset can be found in the file census-income.names, in the same directory. Your goal is to learn everything that you can about the dataset. Answer the following questions as a starting point, but you should dig further. What more can you discover? What is the most interesting and surprising thing that you can dig up?

Starting questions:

How many records are there?
How many features are there?
How many features are continuous, and how many are nominal?
For the continuous features, what are the average, median, maximum, and minimum values? What is the standard deviation?
For the continuous features, use Octave or some other plotting tool to make 2-dimensional scatter plots of two features at a time. (There is another file in the same directory called censusdata.mat, which is an Octave version of the continuous data.) What relationships can you find?

You should submit a paper document describing what you discover.

Important Warnings

Do not run your code on prism. Your program will require a lot of memory, and you'll be a "bad department citizen" if you use prism's resources for this purpose. Use one of the other lab machines. You can also ssh into gray, which is my data mining workstation. Bear in mind though that gray only has four processors, so if more than four of you use it at a time you slow each other down. Use the top command to see how much gray is being used before you start running something bug.
Be smart about how you write your code. In the past, students have tried to do this assignment in Perl (which is fine!) without paying attention to how Perl allocates memory, etc. If you load your data into memory structures that grow dynamically as the program runs, you may find that it runs terribly slow. You should make your program read all the data first and make sure that you can do it efficiently before coding up the rest of the things that you want to do.

Sample Octave session

The following is an example of a session with Octave to make a scatterplot.

Here is my data file, called "example.mat":

# name: example # type: matrix # rows: 5 # columns: 2 1 1 2 3 3 4 5 6 2 9

Here is my Octave session:

prism> octave GNU Octave, version 2.1.35 (i386-redhat-linux-gnu). Copyright (C) 1996, 1997, 1998, 1999, 2000, 2001 John W. Eaton. This is free software with ABSOLUTELY NO WARRANTY. For details, type `warranty'. *** This is a development version of Octave. Development releases *** are provided for people who want to help test, debug, and improve *** Octave. *** *** If you want a stable, well-tested version of Octave, you should be *** using one of the stable releases (when this development release *** was made, the latest stable version was 2.0.16). octave:1> load -force -ascii "example.mat" octave:2> plot(example(:,1),example(:,2),'k@')
To print your plot:

octave:3> gset term postscript octave:4> gset output "output.ps" octave:5> replot
This replots your plot, but dumps it to a postscript file called "output.ps". You can then go out to a Linux prompt and type "lpr output.ps" to dump it to the printer.

To rewire Octave to go the screen again:

octave:6> gset term X11 octave:7> replot

Alternatively, if your postscript files are coming out too large, you can instead dump your plot to a png file:

octave:8> gset term png octave:9> gset output "output.png" octave:10> replot
You can then open output.png in Mozilla, and print it out from there.