CS 127, Assignment #5: Words

The first part of this assignment is due by class time on Wednesday, May 14, 1997, and the second part is due by class time on Monday, May 19, 1997. You may work with a partner on this assignment.

Background

The general problem of searching--that is, given a search key and a database of items with keys, determine whether the key of any item is the same as the search key and if so, retrieve that item--consumes a great deal of computing power. It is also very important to do well. A poorly designed searching mechanism will make library cataloguing software, Web search engines, compilers, and innumerable other programs intolerably slow. On the other hand, a well-designed searching mechanism can provide startlingly fast response, given the vast amount of information some search engines are asked to manage.

One application of searching is spell-checking. Here, the database is a dictionary, and one search is performed for each word in the text being checked. For this assignment, you will use a rudimentary spell-checker (spell.c) to test various implementations of a dictionary. That is, your job will be to write several versions of a dictionary data type, and collect performance data and memory usage information about each version in the context of spell-checking.

I have provided you with one implementation of a dictionary in dictionary.c and dictionary.h. My dictionary stores the words in an array of strings (that is, a 2-dimensional array of char's), sorted in alphabetical order. To search this dictionary, I use binary search. One of the first things you should do is test my program by compiling dictionary.c and spell.c together into an executable called, say, spell. To run this program, type spell dictionaryfile textfile, where dictionaryfile is the name of a file containing the words in the dictionary, and textfile is the name of the file you want checked for spelling errors. Two dictionary files you could use are words.txt and littlewords.txt (the latter of which is named for the lengths of its words, not the number of words).

The Assignment

By Wednesday, 5/14/97, hand in .h and .c files for two dictionary implementations: one using sequential search on a sorted array of strings, and the other using an AVL tree.

By Monday, 5/19/97, hand in .h and .c files for a dictionary implementation based on a hash table. Also, hand in ON PAPER a discussion of the relative performance of the four implementations (my binary search plus your linear search, AVL tree, and hash table). To test performance, you should first choose a dictionary file and a text file to test on all four implementations. Then, run and time each of your implementations. Also, compute how much memory each implementation uses to store the dictionary (printing out values of "sizeof(...)" can help here). You may, of course, try many dictionary/textfile pairs and report your results. In any case, the report you hand in should include not just your data, but also your interpretations of the data, with an emphasis on the relative merits of the various dictionary implementations.

Other stuff

You should time the Initialize & Load Dictionary phase without the Spell-Checking phase, and then time the whole program. That way, you can tell whether slow performance is due to loading or searching or both.

I have set MAX_WORD_LENGTH to 24 in my implementation of the dictionary. You can set it smaller for littlewords.txt.

You should not change spell.c or the prototypes for the four Dictionary functions. You will need to change dictionary.h to introduce your new types, and perhaps new functions called by the four essential functions, but don't change the prototypes for InitializeDictionary, AddToDictionary, IsInDictionary, or PrintDictionary.

The dictionary files I have provided you are both alphabetized, with one word per line. words.txt contains 24,259 words of length up to 22 (that's electroencephalography, by the way), and includes each letter as a word (not just "I" and "a") as well as a lot of proper nouns and acronyms ("AAA", for example). Unfortunately, although it does include "jump", it doesn't include "jumped," "jumping," "jumper," or "jumps." The other dictionary file, littlewords.txt, consists of 79,339 words of length 8 and under. It loses out on the long words, but it does a very good job of getting all the forms of short words (including not only the forms of "jump" listed above, but also "jumpier," "jumpiest," and my favorite, "jumpily").

I don't know about you, but I'd be inclined to create a small dictionary file and a small text file for testing my code at intermediate stages.

Start early, keep in touch, and have fun.



Jeff Ondich, Department of Mathematics and Computer Science, Carleton College, Northfield, MN 55057
(507) 646-4364, jondich@carleton.edu