CS 201: Data Structures

Searching in Java: how fast are Java's collection classes?

Partners: you may work alone or with a partner of your choosing

Note to future users of this assignment. The basic idea is solid. However, the students ran into a consistent problem analyzing dependency on N (the size of the dictionary). It was common to see T decreasing as a function of N for binary search, HashMap, and TreeMap. This was because (1) their timing loops included both the reading of the file to be spell-checked and the printing of misspelled words, and (2) they used subsets of a larger dictionary file. Thus, a smaller dictionary meant lots more printing of misspellings, and the printing time swamped the search times for the faster algorithms. When I tried it myself, I used the full words.txt as my smallest dictionary, adding "words" like goat0001, goat0002, goat0003, etc. to increase the dictionary size from there without changing the list of misspellings. So my printing time remained constant, and I was also using much larger N's than they were. My results looked like they came right out of the textbook. I would definitely run this assignment again, but I would tell them how to handle the dictionary (or just tell them not to print results) to avoid this problem.

In computer science, searching refers to the process of determining whether a given collection of objects contains an object with specified properties. The collection often consists of a identically typed objects, each containing a key that identifies the object, and a search consists of the attempt to find the particular object that matches a specified search key. For example, if you search a database of Person objects by full name, the particular full name you're looking for (e.g. "Elmer Fudd") is your search key.

Often, searching is even simpler than that: you have a list of strings, and you want to know whether your search string is in the list. That's the case where the object and its key are one and the same, and it's also the case we're going to look at now.

Java provides several collection classes for storing objects. For this assignment, you will perform timing exeriments to compare the search speeds of four Java search structures:

ArrayList<String>, unsorted, using linear search
ArrayList<String>, sorted, using binary search
TreeMap<String, Object>, where the Object associated with each String is always null
HashMap<String, Object>, where the Object associated with each String is always null

The question at hand is this. Given a collection of N strings stored in one of these four ways, and a collection of K search strings, what can you say about the relationship between N, K, and the time it takes to search for K strings?

What to hand in

Hand in via Moodle:

A report describing the timing experiments you performed, your timing data, and your interpretation of what your results say about the relationship between N, K, and search time. Your report should be a PDF file no longer than 5 pages including charts and graphs, if any.
Source code you used to collect your timing data.

Suggestions and offerings

One sensible approach to this task would be to start with the code I demonstrated in class on Friday: WordGameDictionary.java, LinearSearchDictionary.java, and SpellChecker.java. WordGameDictionary provides the interfaces required to add words from a dictionary file into a collection and to search in the collection once it's assembled. LinearSearchDictionary provides an implementation of the first bullet point above (ArrayList<String>, unsorted, with linear search). To handle the other three collection types, you will need to create three more implementations analogous to LinearSearchDictionary, all of which can be very short. (All four of my implementations came to 161 total lines, of which 50 are comments, and many more are identical in all four files.)
Note that I have included simple timing code in SpellChecker, so you can use that to help you get started. Note also that I am timing only the searching. For this experiment, I don't care how long it takes to load the dictionary. I only care how long it takes to search the dictionary.
To study the relationship between dictionary size N and total search time, you'll need to use a variety of sizes of dictionary. You could get several orders of magnitude by just using various-sized subsets of the 267,000-word words file I used when we spell-checked Hamlet. If you want even larger N's for comparison, you could add some random string to the end of every word in words.txt to get an additional 267,000 "words" (e.g. use "XXXX" to get a "dictionary" containing "AAXXXX", "AAHXXXX", "AAHEDXXXX", etc.).
To study the relationship between the number of searches K and total search time, you'll need a variety of sizes of files to be spell-checked. One option here is to find a really big file and try subsets of it. Another is to find a modest-sized file (e.g. the text of Hamlet) and use as many copies of it as you need to get the values for K that you want.
You'll need pretty big values of K to see noticeable distinctions between the fastest of the four collection classes.
HashMap<K, V> and TreeMap<K, V> are designed to allow you to map some key (of class K) to some value (of class V). For example, you might use String as K and Person as V to enable you to search a collection of Person objects by full name. But in our situation, we don't care about what V is--we just care about how long it takes to find it. Thus, you should just set the values in your HashMap or TreeMap to null.
There's nothing mysterious about this assignment. We want to understand the relationships, if any, between N, K, and running time. Your job is to use your excellent experimental and analytical abilities to discover what you can about those relationships.
Sure, you could work alone. But it's pretty cool when you say "I wonder what happens if we do this?" and your partner says "Let's try it!"

As always...

Start early, ask questions, and have fun!