CS 127, Data Structures

Hashing and Searching Benchmarking

Assigned on Wednesday, 11/6.
Program due electronically on Wednesday, 11/13, at 4 PM.
Summary document due on paper on Wednesday, 11/13, at 4 PM, in Justin's mailbox.

Overview

For this assignment, you will run an experiment to compare the effectiveness of hashing, vs. the effectiveness of sorting and binary searching. Does the theoretical analysis hold up? We'll see!

Details

1. Create a file called file1 containing n random integers, one on each line.

2. Create a second file, file2, of random integers that contains n/2 integers contained in file1 and n/2 new random integers.

3. The idea for this assignment is to search file1 for each of the integers in file2. Because of the way you created the files, half of the key integers in file2 will be found and half will not. You will implement this search in three ways:

merge sorting and binary searching
hashing in an array
hashing via chaining.

The goal is to see if there is a time difference between the methods.

4. You will need to time the two algorithms. Look at timing.cpp for some sample code on how to do this. Additionally, you should keep track of and display the following information.

For sorting and binary search:

Number of comparisons of keys to sort data
Average and maximum number of comparisons of keys for a successful searches
Average and maximum number of comparisons of keys for unsuccessful searches

For hashing in an array:

Total number of times times hash function is called while creating hash table
Total number of comparisons of keys while creating hash table
Total number of times that the hashing scheme has to move to another slot while creating hash table
Average and maximum number of comparisons of keys in a successful search
Average and maximum number of comparisons of keys in an unsuccessful search

For hashing via chaining:

Total number of times hash function is called while creating hash table
Total number of comparisons of keys while creating hash table
Average and maximum length of chains in complete hash table
Average and maximum number of comparisons of keys in a successful search
Average and maximum number of comparisons of keys in an unsuccessful search

Make sure that your experiments take long enough that your experimental results are meaningful. Comparing two programs that run in 0.05 seconds and 0.07 seconds and concluding that the first one is faster is not good experimental technique - anything could have contributed to the difference! This DOES NOT MEAN that you should make n as large as possible. A moderately sized n with repeated trials (on the same and different datasets) can be revealing.

Prism (our department server) will keel over and die if all of you run large experiments on it at once. Therefore, heed the following advice.

Keep n very small when you are writing and debugging your code. This is good practice anyway, as it will save you time in testing your program.
Only run experiments on prism with n <= 1000.
Use one of the machines in the labs or your own machine for running larger experiments.

If you overload prism while running these experiments, I have authorized Mike Tie (the system administrator) to do whatever draconian measures are necessary. This could mean automatically logging you off, etc... so please be a good citizen!

Submissions

Hand in electronically your search code (in a program called benchmark.cpp) and a version of file1 and file2 that each contain 100 integers. The version of benchmark.cpp that you turn in should work with these two files, so that we can test your code on a small dataset. Also turn in a paper document (due in the grader's mailbox at 5 PM) reporting on the results you got for various n. This report should describe the details of the hashing schemes you used, the times, and an analysis of your counts and times. This document will be graded as a written work, and you should therefore pay attention to matters of spelling, grammar, style, and clarity.

Optional Bonus:

Do you think that you might have the fastest lookup code around? Can you improve on the methods you tried above? If so, you can optionally submit a second program, called lookup.cpp. This code will take two files of integers, as above, and use whatever algorithm you like to count how many of the integers in file2 are in file1. The program should simply output the number of successful searches. We will run and time all the submissions of lookup.cpp against our own input files to see which one runs the fastest.