Counting words

This assignment is due by noon Friday, February 20. You may work with a partner. Submit your code and any relevant explanations using HSP.

The Assignment

For this assignment, you will write a program that will count the number of times each word occurs in a given text file. To do this, you should maintain a binary search tree of nodes, each of which contains a word as its key, and a counter to keep track of the number of times the word in question has been encountered. The output of your program should be an alphabetized list of words and counts.

When you get your program working, try it on some big files (if you look around the NeXT directory structure (try /LocalLibrary/Literature), you might find Hamlet or something similar to use--the Web is also full of enormous text files waiting to be downloaded and counted). Just for fun, time your program using the UNIX utility time. Now try your program on the small dictionary file /usr/dict/words (about 25000 words, one per line). How does your program perform on this one?

I will make no recommendation here as to how to structure your program internally, but you should plan your code's design away from the computer, and try to program modularly. You may be able to use much of the code from this program on later assignments, but only if you separate this program into reasonable pieces. The functions that get words from the input should be separate from the BST functions, etc.

What do you think should constitute a word? I'll leave that one up to you, but it seems to me that "word?" is not a word, though a simplistic approach to word reading might have found this very "word" at the end of the previous sentence. On the other hand, you should definitely use the most convenient-to-code definition of word for the early stages of your program. Get the binary search tree and word counts working before you start worrying about the difference between "word", "Word", and "word!".

A couple of programs that might help you are inputFiles.cpp and gnustrings.cpp.

Start early, keep in touch, and have fun. What is the most common word in Hamlet, anyway? (Ooh! How would you sort these babies by count instead of key?)

Jeff Ondich, Department of Mathematics and Computer Science, Carleton College, Northfield, MN 55057, (507) 646-4364, jondich@carleton.edu