CS 201: Data Structures (Spring 2018)

HW12: Hashing

Due: Wednesday, 05/30 at 17:00 (Note unusual due time by college regulation.)

1. Goals

To get a better understanding of hashing.

2. Setup

This is a solo written assignment.

This assignment is primarily a problem set, and should be completed individually. You are welcome to talk with others about the problems, but you must write up your assignment by yourself. If you talk with others about the problems, please note their names at the top of your homework submission.

The problems reference work you did during an in-class lab. You may use any notes from that session, or even turn in parts of your notes from the lab verbatim. You will also need your code from the lab as you will turn it in. You should note your and your partner's name at the top of the lab; if you missed class, you should do the lab on your own. Other than that, you should not use any notes from talks with others while writing up your assignment. If you need notes, it likely means you still don't quite understand how to do the problem. Go back over your notes, ask more questions, or try to solve a similar problem on your own. Then come back a day or so later and try to write up your solution without your notes.

As always, you will submit your work on Moodle. This means that you will need to create a single PDF of your solutions to the written exercises. You have a few options for this:

Remember, the goal of your writeup is to communicate your understanding to us, so we can't accept writeups that we cannot read (whether that's due to handwriting, image quality, or file format issues).

3. The assignment

In class, you worked on a lab about hash codes. Make sure you have your notes/code from that lab before starting these questions.

  1. If you didn't finish the lab in class, finish it now. Give results for the number of empty buckets and the average number of items per non-empty bucket when using 196613 and 200000 buckets with each of the five hash functions (including Java's built-in one) from both enable1.txt and pride.txt, and explain which hash function you would prefer to use (and why) based on these results. You're welcome to supplement your results with any thing else you measured about collisions in the lab: use your evidence-based reasoning skills!
  2. Briefly, explain why hashCode2(String) from the in-class lab isn't a very good hash function for English words. Give two (or more) distinct reasons.
  3. Imagine I decide to have a hash table with 264 buckets. My hash code function maps to each bucket based on the first four letters of the word (e.g., aaaa has its own bucket, aaab has its own bucket, and so on). Explain why this hash function may be inefficient for storing English words.
  4. I have a ZooAnimal class that has instance variables for the animal's name, species, year of birth, and a path to an image of the animal. Propose a way to implement hashCode() for ZooAnimals such that two ZooAnimals that are equal (based on homework 7) will return the same hash code and ZooAnimals that are not equal will tend not to return the same hash code (i.e., just returning the same code for all ZooAnimals is not a correct answer). You may use any methods you want.
  5. Imagine we have seven buckets in our table (i.e., our array is of length 7). We've decided to handle collision resolution via linear probing. Our hash code takes the first letter of a word and turns it into an integer based on its position in the alphabet such that a=1,b=2,c=3,...,z=26. For compression, we just use modulo the array length.
    1. Draw the array after adding the following items, in the order below:
         group
         set
         ring
         lattice
         field
      
    2. What indices in the array do I examine if I'm trying to find out if the hash table contains the word magma? I examine an index if I access that index (e.g., if I say array[i] at some point, then I've examined index i).
  6. Now imagine we have ten buckets in our hash table, and we do collision resolution using quadratic probing. We use the standard compression function of the modulus of the array length. We'll be storing integers in our array, and the hash code function returns the integer mod 10: that is, the remainder when the integer is divided by 10. Draw the array after adding the following items, in the order below; assume the hash table does not resize during these additions:
       38
       52
       47
       19
       78
       67
       71
       24
    
  7. When implementing a Map using hashing, we usually assume that the object to be stored has a hashCode function and that the compression function is part of the Map implementation. Why don't we have both these functions in the same class: i.e., why wouldn't it make sense for the Map implementation to include code for both the hash code and compression functions, or for each object to implement a hash code function and make sure that hash code returns a value in the correct range? (Make sure you address both cases.)
  8. Extra practice: You do not need to turn these in, but I encourage you to try the self test questions in your textbook related to hashing, especially those related to collision resolution using separate chaining and the different open addressing methods (linear probing, quadratic probing, and double hashing).

4. Submission and grading

Submit to Moodle a zip archive containing:

Remember that you must write up your answers alone, without looking at notes from collaborations with others, and you should include the names of any collaborators at the top of your writeup. If you didn't talk with anyone about the problem set, please note that (e.g., "Worked alone").

Grading

This assignment modified from one designed by Anna Rafferty.

Start early, have fun, and discuss questions on Moodle.