CS 127, Assignment #4: Words

This assignment is due by class time on Monday, April 28, 1997. You may work with a partner on this assignment.

The Assignment

For this assignment, you will write a program that will count the number of times each word occurs in a given text file. To do this, you should maintain a binary search tree of nodes, each of which contains a word as its key, and a counter to keep track of the number of times the word in question has been encountered. The output of your program should be an alphabetized list of words and counts.

I will make no recommendation here as to how to structure your program internally. The only interface I insist on is the command line interface. If your program is called with no command line arguments, it should take its input from stdin. If your program is called with one command line argument, it should interpret that argument as a file name, and take input from that file (if it exists). In all cases, your program's output should go to stdout, with error messages sent to stderr. See here for a little help processing command line arguments.

When you get your program working, try it on some big files (if you look around the NeXT directory structure, you might find Hamlet or something similar to use--the Web is also full of enormous text files waiting to be downloaded and counted). Just for fun, time your program using the Unix utility time. Now try your program on the small dictionary file /usr/dict/words (about 25000 words, one per line). How does your program perform on this one (by the way, you might want to redirect the output to a file in this case).

Program modularly. You may be able to use much of the code from this program on your next assignment (a simple spell-checker), but only if you separate this program into reasonable pieces. The functions that get words from the input should be separate from the BST functions, etc.

What do you think should constitute a word? I'll leave that one up to you, but it seems to me that "word?" is not a word, though a simplistic approach to word reading might have found this very "word" at the end of the previous sentence.

Start early, keep in touch, and have fun. What is the most common word in Hamlet, anyway?

Jeff Ondich, Department of Mathematics and Computer Science, Carleton College, Northfield, MN 55057
(507) 646-4364, jondich@carleton.edu