Spelling

This assignment is due by 11:10 Wednesday, March 4, 1998. You may work with a partner. Submit your code and any relevant explanations using HSP. Submit your timing data and analysis on paper.

The Assignment

For this assignment, you will write two spell-checkers and compare their speeds. One will store its dictionary as a sorted list of words, and the other will use a hash table.

Each spell-checker should use more-or-less the same main program. Its structure, in pseudo-code, might look like this:


	Open the dictionary file.
	For each word in the dictionary
		add the word to the list or hash table.
	Close the dictionary file.

	Open the big monster text file.
	For each word in the BMTF
		make the word lower case, and
		* search the list or hash table for the word.
		(For debugging, maybe print misspelled words
			or at least count them.)
	Close the BMTF.

Note that many of these pseudo-code lines will be handled by function calls in the real code.

The central purpose of this assignment is to introduce you to the performance differences for binary search and hash tables. Both are wonderful search techniques, but how do they compare in theory and in practice? That's the big question. To answer this question, you'll need to time the searching operations. To do this, first comment out the line with the *. That is, do absolutely everything except the searching, and time the program using the UNIX "time" command. Then uncomment the * line and run the program again, timing the result. The difference should be a decent measure of the total searching time. Note that the BMTF really needs to be big so that the reported search time is not 0.

The Binary Search Version

For this version, you can use a very simple structure to store the dictionary: a sorted array of Strings. While you're debugging and testing your code at first, you'll want to write a small dictionary file and a small text file. But later, you'll need to try it on big files. Here is a big dictionary. Try the UNIX command "wc" on it, and you will discover that this dictionary has 79,339 words in it. Fortunately, as "more bigdictionary.txt" will demonstrate, they are already sorted, lower case, and one per line, so all you'll have to do is read them one at a time in order into an enormous array of Strings.

You will need to declare an array of Strings, but you won't be able to declare such a big one inside any function. The reason for this is that the system stack, where all local variables live, has a maximum size that cannot accomodate 80,000 String objects. So I'm going to do what all CS teachers take a solemn oath never to do--I'm going to tell you to use a global variable. If the CS Police find out, I'm history, so try to keep it quiet. Anyway, try this:


	#include	
	#include	

	String	dictionary[80000];

	int main( void )
	{
	etc.

That's about all I'm going to tell you about this one. You need to be able to build the dictionary, one word at an alphabetical time, and you need to be able to search for a word in the dictionary using binary search. That's not binary search tree--just binary search.

The Hash Table Version

For this version of the program, use a 160,000-item hash table and open hashing (also known as "chaining"), as described in section 5.3 of Weiss. Your code will start like this:

	#include	
	#include	
	#include	"hashtable.h"

	HashNode	*dictionary[160000];

	int main( void )
	{
	etc.
Here, "HashNode" is a linked-list node type defined by you in the header file hashtable.h. I'm using Figure 5.5 on page 185 as my model here. Weiss is a bit sneakier, putting some old code (his "List" type) to use in his implementation of an open hash table. You may use your judgement on this.

The big decision you'll need to make for this spell-checker is what your hash function will be. It should be chosen to give you a good spread of words through the hash table. Note that adding up the ASCII values of the characters will be no good--no word would come even close to a hash value of 10,000, let alone 160,000.

That's about all I have to say about the hash table version.

Advice

Design each program away from the computer. Think about the stages in which you'll code your programs. You want to write a small portion of the code, compile it, and test it, and then start that process again, gradually building up your fully functioning program out of partial versions. It takes care to decide on an order of coding that allows you to test little pieces along the way.

Try your code on a little dictionary before you attack the huge one.

Watch your account's quota. The big dictionary takes up a pretty big chunk of it. The UNIX command "ls -l" shows you the sizes in bytes of all your files. Note that you should also remove old executables, because they take up an obscene amount of space.

You may take code from Weiss, but you should cite it if you do.

As always, start early, keep in touch, and have fun.



Jeff Ondich, Department of Mathematics and Computer Science, Carleton College, Northfield, MN 55057, (507) 646-4364, jondich@carleton.edu