CS127 Assignment: Searches
Due Monday, 2/19/01.
Submit your analysis and pictures on paper, and your
code via HSP.
Background: Search Structures
If you have an array or linked list or file full of items
of data, and you want to search for a particular item
in that array or linked list or file, you can use linear
search. Linear search places no organizational requirements
on the data beyond that they be stored in some sort of
sequentially accessible fashion. As a result, it is very
easy to add new items to the database. Furthermore, the
code for performing linear search is simple and clear.
Unfortunately, searching for an item via linear search can require
examining every item in the database. Binary search improves
dramatically on linear search by requiring an examination of no
more than log(N)+1 items during a search. But binary search
of an array requires that the array remain sorted, which makes
both addition of new items and deletion of old items slow.
And binary search can't be done at all with a sorted linked list.
That's why binary search trees were devised--to retain the search-time
benefits of binary search, while facilitating fast addition and
deletion of items. Once again, there's a downside--if you're unlucky,
your binary search tree may become badly unbalanced, leaving you with
a data structure whose searching/adding/deleting performance is
no better than that of an unsorted array using linear search.
So balanced trees (AVL trees, red-black trees, 1-2-3 trees,
B-trees,...) were devised. All of these data structures have associated
algorithms that guarantee O(log N) searching, addition, and deletion.
When N gets big enough and the searches, additions, and deletions frequent
enough, however, even a balanced binary search tree's performance can dominate
a computation. For example, while compiling a large project,
a compiler maintains a "symbol table" of thousands of variable and function
names. If you use a balanced tree for a compiler's symbol table, the run time
of the compiler will be dominated by the time taken to repeatedly
search the symbol table.
This led people to try to beat O(log N) searching, which led to hash
tables. (Actually, I'm not sure of the historical order of these search
structures--I am guessing that hash tables predated many of the balanced
search trees, but I don't know.)
The Goal
For this assignment, you will collect empirical data
comparing the search times of
a balanced search tree and hash tables of various sizes and
collision-resolution techniques.
The Task
- Get a large text file, where large is at least several
hundred kilobytes, and preferably over a megabyte.
Project Gutenberg (www.gutenberg.net) is a wonderful source
of such files. Eliot's "Middlemarch" (1766KB), Dante's
Divine Comedy (668KB), or the Bible (4240KB) would be suitable.
Want to try a non-English text? Try Proust's
"A la recherche du temps perdu" at about 400KB for each of
its three volumes (what is that, about 10KB per minute of
remembrance of things past?).
- Compile and run
mapwordcounter.cpp
with your big file as the input file. (Just run the program
without arguments to get a usage statement.)
This will give you an idea of how long your file will take
to process, and will also give you timing data for a balanced
search tree, since map is implemented using a balanced
tree.
- Implement a hash table with linear probing and use it
to collect word counts.
- Implement a hash table with quadratic probing and use it
to collect word counts.
- Now that you have three word-counting programs, collect
running times for:
- The balanced tree program that uses map.
- Your hashing program using quadratic probing.
Time this for several sizes of hash table and two or three
different hash functions. Note that for quadratic probing,
you should use prime table sizes.
Also, you should try to include table sizes that are
big enough to hold the data, but too small to support
- Your hashing program using linear probing.
Time this for several table sizes and hash functions.
- Display your timing data graphically, and write a report
on your observations. In your report you should describe what you did,
describe the text file you used (i.e. the identity of the
file, where you got it, the size of the file, and the number
of distinct words in the file), summarize your data,
compare your results to your expectations based on the book's
mathematical analysis, and make any further observations you
consider interesting and relevant.
What to hand in
Hand in your written discussion, timing data, and graphs on paper.
Hand in your hash table programs via HSP.
Advice
- Work with somebody.
- Before you approach the computer, take an hour to make a
step-by-step list of small tasks you will perform, the order
in which you will perform them, and how you plan to test the
results of those tasks. Try to make each task concrete, manageable,
and testable (e.g. "type in Weiss's code from page N, and test it
using the following tiny main program").
- Write your code modularly, with separate functions for the
hash function and the probing strategy, so you can easily modify
just one or the other without modifying the rest of your program.
- As you develop your programs, test them on small text files
until they are working. Then apply them to the big file.
- Having trouble? Talk to me or to Jeremy Carr.
Have fun, start early, and keep in touch.
Jeff Ondich,
Department of Mathematics and Computer Science,
Carleton College, Northfield, MN
55057,
(507) 646-4364,
jondich@carleton.edu