CS 127, Data Structures

Note: Changes have been made to this assignment to clean it up for re-use later (11/8/01).

Experimental CS: Benchmarking Hashing vs. Sorting & Binary Searching

Due Monday, 11/5/01, at 5 PM.

Overview

For this project, you will run an experiment to compare the effectiveness of hashing, vs. the effectiveness of sorting and binary searching. Does the theoretical analysis hold up? We'll see!

Details

1. Create a file, file1, of n random integers.

2. Create a second file, file2 of random integers that contains n/2 integers contained in file1 and n/2 new random integers.

3. The idea for this assignment is to search file1 for each of the integers in file2. Because of the way you created the files half of the key integers in file2 will be found and half will not. You will implement this search in three ways:

sorting and binary searching
hashing in an array
hashing via chaining.

The goal is to see if there is a time difference between the methods.

4. You will need to time the two algorithms. Look at timing.cpp for some sample code on how to do this. Additionally, you will need to keep track of various counts that reflect the amount of work that the scheme had to do. For example, for the binary search::

sort_comp // keeps track of total number of comparisons of keys needed to sort data
bs_comp_success // keeps track of total number of comparisons of keys for all successful searches
bs_comp_fail // keeps track of total number of comparisons of keys for all failed searches

Likewise, for hashing in an array, you need to keep track of

hash_create_count // keeps track of total number of times hash function is called while creating hash table
hash_create_comp // keeps track of total number of comparison of keys needed while creating hash table
hash_create_next // keeps track of total number of times that the hashing scheme has to move to another slot while creating hash table
hash_count // keeps track of total number of times hash function is called while searching hash table
hash_comp // keeps track of total number of comparison of keys needed while searching hash table
hash_next_success // keeps track of total number of times that the hashing scheme has to move to another slot during successful searches
hash_next_fail // keeps track of total number of times that the hashing scheme has to move to another slot during failed searches

You should modify these as you see fit, as well as keep track of the appropriate information for hashing by chaining.

You can use these counts to compute various averages. For example,

(sort_comp + bs_comp_success + bs_comp_fail)/n

measures the average cost (in comparisons) needed to sort and binary search the data.

(sum of last 7 counts)/n

measures the average cost (in comparisons and function calls) needed to create and search the hash table. Think of some other useful averages and report them. For example, you might want to compute separate averages for success and failure, or you might estimate how much more expensive a function call is than a comparison and use this to compute a weighted average. The rough outline of your code should be:

input file1;

start clock;
    create sorted list S from file1;   // use efficient mergesort on linked list
                                       // and then put result into an array
    for each i in file2, use binary search to find i in S or discover i not in S
stop clock;

start clock;
create array based hash table T from file1;
for each i in file2, use hash table search to find i in T or discover i not in T
stop clock;

start clock;
create chaining based hash table T from file1;
for each i in file2, use hash table search to find i in T or discover i not in T
stop clock;

report times;
report all counts and averages

You should make sure in each of your examples that you choose n large enough so that you can get good experimental results. Comparing two programs that run in 0.05 seconds and 0.07 seconds and concluding that the first one is faster is not good experimental technique - anything could have contributed to the difference! This does not necessarily mean that you should make n as large as possible. A moderately sized n with multiple trials on more than one random data set can be revealing.

Keep n moderate in size so that we don't overburden the servers as you run your experiments. Make sure to keep n small while you're testing your program.

Submissions

Hand in electronically your search code (in a program called benchmark.cpp) and a version of file1 and file2 that each contain 100 integers. The version of benchmark.cpp that you turn in should work with these two files, so that we can test your code on a small dataset. Also turn in a paper document (due in the grader's mailbox at 5 PM) reporting on the results you got for various n. This report should describe the details of the hashing schemes you used, the times, and an analysis of your average counts and times. This document will be graded as a written work, and you should therefore pay attention to matters of spelling, grammar, style, and clarity.