Assignment #1: Getting Started
For this assignment, you will explore a dataset to see what you can
uncover.
I have placed a copy of the dataset titled "census-income" in the
directory /Accounts/courses/cs377/census-income/census-income.data. A
dictionary for this dataset can be found in the file
census-income.names, in the same directory. Your goal is to learn
everything that you can about the dataset. Answer the following
questions as a starting point, but you should dig further. What more
can you discover? What is the most interesting and surprising thing
that you can dig up?
Starting questions:
- How many records are there?
- How many features are there?
- How many features are continuous, and how many are nominal?
- For the continuous features, what are the average, median, maximum, and
minimum values? What is the standard deviation?
- For the continuous features, use Octave or some other plotting
tool to make 2-dimensional scatter plots of two features at a
time. (There is another file in the same directory called
censusdata.mat, which is an Octave version of the continuous data.)
What relationships can you find?
You should submit a paper document describing what you
discover.
Important Warnings
- Do not run your code on prism. Your program will require
a lot of memory, and you'll be a "bad department citizen" if you use
prism's resources for this purpose. Use one of the other lab
machines. You can also ssh into gray, which is my
data mining workstation. Bear in mind though that gray only
has four processors, so if more than four of you use it at a time you
slow each other down. Use the top command to see how much
gray is being used before you start running something
bug.
- Be smart about how you write your code. In the past,
students have tried to do this assignment in Perl (which is fine!)
without paying attention to how Perl allocates memory, etc. If you
load your data into memory structures that grow dynamically as the
program runs, you may find that it runs terribly slow. You should make
your program read all the data first and make sure that you can do it
efficiently before coding up the rest of the things that you want to
do.
Sample Octave session
The following is an example of a session with Octave to make a
scatterplot.
Here is my data file, called "example.mat":
# name: example
# type: matrix
# rows: 5
# columns: 2
1 1
2 3
3 4
5 6
2 9
Here is my Octave session:
prism> octave
GNU Octave, version 2.1.35 (i386-redhat-linux-gnu).
Copyright (C) 1996, 1997, 1998, 1999, 2000, 2001 John W. Eaton.
This is free software with ABSOLUTELY NO WARRANTY.
For details, type `warranty'.
*** This is a development version of Octave. Development releases
*** are provided for people who want to help test, debug, and improve
*** Octave.
***
*** If you want a stable, well-tested version of Octave, you should be
*** using one of the stable releases (when this development release
*** was made, the latest stable version was 2.0.16).
octave:1> load -force -ascii "example.mat"
octave:2> plot(example(:,1),example(:,2),'k@')
To print your plot:
octave:3> gset term postscript
octave:4> gset output "output.ps"
octave:5> replot
This replots your plot, but dumps it to a postscript file called
"output.ps". You can then go out to a Linux prompt and type "lpr
output.ps" to dump it to the printer.
To rewire Octave to go the screen again:
octave:6> gset term X11
octave:7> replot
Alternatively, if your postscript files are coming out too large, you can
instead dump your plot to a png file:
octave:8> gset term png
octave:9> gset output "output.png"
octave:10> replot
You can then open output.png in Mozilla, and print it out from there.