CS127 Assignment due Monday, 2/19/01

CS127 Assignment: Searches

Due Monday, 2/19/01. Submit your analysis and pictures on paper, and your code via HSP.

Background: Search Structures

If you have an array or linked list or file full of items of data, and you want to search for a particular item in that array or linked list or file, you can use linear search. Linear search places no organizational requirements on the data beyond that they be stored in some sort of sequentially accessible fashion. As a result, it is very easy to add new items to the database. Furthermore, the code for performing linear search is simple and clear.

Unfortunately, searching for an item via linear search can require examining every item in the database. Binary search improves dramatically on linear search by requiring an examination of no more than log(N)+1 items during a search. But binary search of an array requires that the array remain sorted, which makes both addition of new items and deletion of old items slow. And binary search can't be done at all with a sorted linked list.

That's why binary search trees were devised--to retain the search-time benefits of binary search, while facilitating fast addition and deletion of items. Once again, there's a downside--if you're unlucky, your binary search tree may become badly unbalanced, leaving you with a data structure whose searching/adding/deleting performance is no better than that of an unsorted array using linear search. So balanced trees (AVL trees, red-black trees, 1-2-3 trees, B-trees,...) were devised. All of these data structures have associated algorithms that guarantee O(log N) searching, addition, and deletion.

When N gets big enough and the searches, additions, and deletions frequent enough, however, even a balanced binary search tree's performance can dominate a computation. For example, while compiling a large project, a compiler maintains a "symbol table" of thousands of variable and function names. If you use a balanced tree for a compiler's symbol table, the run time of the compiler will be dominated by the time taken to repeatedly search the symbol table. This led people to try to beat O(log N) searching, which led to hash tables. (Actually, I'm not sure of the historical order of these search structures--I am guessing that hash tables predated many of the balanced search trees, but I don't know.)

The Goal

For this assignment, you will collect empirical data comparing the search times of a balanced search tree and hash tables of various sizes and collision-resolution techniques.

The Task

Get a large text file, where large is at least several hundred kilobytes, and preferably over a megabyte. Project Gutenberg (www.gutenberg.net) is a wonderful source of such files. Eliot's "Middlemarch" (1766KB), Dante's Divine Comedy (668KB), or the Bible (4240KB) would be suitable. Want to try a non-English text? Try Proust's "A la recherche du temps perdu" at about 400KB for each of its three volumes (what is that, about 10KB per minute of remembrance of things past?).
Compile and run mapwordcounter.cpp with your big file as the input file. (Just run the program without arguments to get a usage statement.) This will give you an idea of how long your file will take to process, and will also give you timing data for a balanced search tree, since map is implemented using a balanced tree.
Implement a hash table with linear probing and use it to collect word counts.
Implement a hash table with quadratic probing and use it to collect word counts.
Now that you have three word-counting programs, collect running times for:
- The balanced tree program that uses map.
- Your hashing program using quadratic probing. Time this for several sizes of hash table and two or three different hash functions. Note that for quadratic probing, you should use prime table sizes. Also, you should try to include table sizes that are big enough to hold the data, but too small to support
- Your hashing program using linear probing. Time this for several table sizes and hash functions.
Display your timing data graphically, and write a report on your observations. In your report you should describe what you did, describe the text file you used (i.e. the identity of the file, where you got it, the size of the file, and the number of distinct words in the file), summarize your data, compare your results to your expectations based on the book's mathematical analysis, and make any further observations you consider interesting and relevant.

What to hand in

Hand in your written discussion, timing data, and graphs on paper. Hand in your hash table programs via HSP.

Advice

Work with somebody.
Before you approach the computer, take an hour to make a step-by-step list of small tasks you will perform, the order in which you will perform them, and how you plan to test the results of those tasks. Try to make each task concrete, manageable, and testable (e.g. "type in Weiss's code from page N, and test it using the following tiny main program").
Write your code modularly, with separate functions for the hash function and the probing strategy, so you can easily modify just one or the other without modifying the rest of your program.
As you develop your programs, test them on small text files until they are working. Then apply them to the big file.
Having trouble? Talk to me or to Jeremy Carr.

Have fun, start early, and keep in touch.

Jeff Ondich, Department of Mathematics and Computer Science, Carleton College, Northfield, MN 55057, (507) 646-4364, jondich@carleton.edu