CS 204: Software Design
Analyzing on-line dictionary logs
Hand in by 9:50 AM Monday, Sep 27
The Ultralingua on-line dictionary
is free to use for about half a dozen word look-ups per day. Its underlying services
(conjugation of verbs, number translation, and several kinds of word look-up) also
support the on-line services of several other companies. The current incarnation of
this collection of tools has been in continuous service for about two years, and has
generated about five and a half million log entries in that time. That's about 5 log
entries per minute--hardly Google, but still quite a bit of data.
Each log entry looks something like this:
('2010-09-15 14:24:50', 'Ulod', 'Ultralingua', 'Onlinedictionary', '1453080969', 'english', 'french', 'define', None, 'crazy')
In order, the fields of the log entry are:
- Date: the date and time of the request
- Tool: the tool that made the request (in this case, the "Ultra Lingua On-line
Dictionary", or "ULOD")
- Client: the company or organization whose tool made the request
- Referrer: the URL of the web page from which the request was launched. As you
can see from this single example, sometimes the Referrer is not an URL, but some
sort of symbol the logging function selected for the most common referrer (in this
case, Ultralingua's own "Onlinedictionary" page).
- IP address: this is the IP address from which the request was made, and will
generally be an IP address with which the person who made the request is associated.
Note that "1453080969" is an integer, stored in binary as a 32-bit integer, from
which the more traditional w.x.y.z IP address would be extracted. (How can you
convert from one to the other?)
- Source language: when the requested service is "define" (or looking up a word
in the dictionary), this language is the language of the word the user is entering
(or at least the software assumes it is). If the service is "conjugate" or some
other one-language service, then this is the language of the requested word.
- Destination language: for "define" requests, this is the language into which the
user wants the requested word to be translated.
- Type: this is the type of service being requested. "define", "conjugate", etc.
- URL: if the Type field is "enable page" or something to that effect, this is
the URL of the page the user wants "enabled". I'll describe page-enabling
in class.
- Word: this is the word the user wants defined or conjugated or whatever.
Here is the complete log file. Note that it is
nearly 90MB zipped, and over 850MB unzipped. So make sure you have enough space on
your computer to handle it. Also, I have the system set up so you can only grab
this data file if your IP address is one of Carleton's.
The goal
For this project, you will write a command-line program that filters the log
file in various ways to produce useful reports on aspects of the data. The
command-line syntax of your program will be:
python loganalyzer.py [options] [logfile]
Your program will print all its output to standard output (via print or
sys.stdout.write). If the logfile command-line argument is present, then your
program should take input from the specified logfile. Otherwise, your program
should take input from sys.stdin.
The required command-line options are:
A good job on these required elements will be worth a B for this assignment.
To move into the A range, you will need to implement at least one non-trivial
additional feature. Some possibilities include:
- --country=ISOCODE -- this will restrict reports to log entries whose
IP addresses come from the specified country. (Search for "ISO country codes"
to get a list of 2-letter or 3-letter country codes.)
- --report=??? -- find instances of attempts to extract all the data from
the on-line dictionary
- --report=??? -- find instances of attempts at injection attacks on the system
- --report=??? -- report patterns in the number of searches per day performed
by individual users
- Something to reflect the average happiness of users, as we discussed in class.
- Other ideas
What to hand in
Hand in via the Courses folder (Courses/f10/cs/cs204-00-f10/Student Work/youraccount/hand-in/)
a folder called "loganalyzer". In this folder, include:
- Your source code. It can be in multiple Python files if you wish, but the
main program should be in a file called loganalyzer.py.
- A file called readme.txt that includes a brief description of the status
of your program. What works, what doesn't, did you add any features, etc.
Also, any additional syntax required to use your program.
- Any other files (not including the giant log file itself) your
program needs.
Things to keep in mind
Use subsets of the data to test your program. It will be too slow
to do every little test on the full data set.
Don't keep lots of .8GB files lying around.
Please don't share this data widely, and please delete it when you're all done.