You may work with a partner for this assignment. In fact, I encourage you to do so, since I think you will benefit from discussing the project with somebody else. Submit your source code via HSP.
Last time I asked you to write a program to manipulate words from a text
file, you didn't know about arrays or ArrayList
s. This left you in the position
of doing only computations that could be completed while storing only one word at
a time. Computing the average word length, for example, can be done by keeping a
running total of the lengths of the words seen so far: get a word, add its length
to the total, repeat.
As it happens, the most interesting computations involving words can't be done by storing only one word at a time. A simple example is the preparation of a list of all the words in a document, along with the number of times each word appears in the document. You might argue that counting words isn't especially interesting, but in fact word counts have many remarkable applications. Word counts have been used, for example, to identify the authors of those Federalist Papers whose authors were previously unknown (Frederick Mosteller and David L. Wallace. Inference in an Authorship Problem. A Comparative Study of Discrimination Methods Applied to the Federalist Papers. Journal of the American Statistical Association 58 (1963): 275-309). For another, most automatic speech recognition (computer dictation) systems use counts of words, word pairs, and word triples to help figure out what sentence you have spoken into a microphone.
For this assignment, you will write a program that takes a file as input and produces a list of the words found in the file along with the number of times each word appears in the file. Your list of words and their counts will be displayed twice--once sorted by count, largest count first, and once sorted alphabetically. Your program's output will look something like this:
==== Alphabetical ===== aardvark 2 and 27 are 14 as 8 asp 1 ... ==== Sorted by count ==== the 45 it 28 and 27 ...
Take a look at WordCountTester.java.
The main
method in WordCountTester
instantiates a
Scanner
and a WordCounter
, and then calls a few
methods on its way to displaying the required lists.
Your job is to create the WordCounter
class and all the public methods
required by WordCountTester
. I will leave the organization of the
WordCounter
class to you, but you should make sure that whatever you write,
it works with WordCountTester
without changing any of the
WordCountTester
code.
How would I approach this program? Like so:
First, I would create a WordCounter.java
with
stubs for all the methods WordCountTester
expects
WordCounter
to have. Once you have created these stubs correctly,
you should be able to compile WordCounter.java
and
WordCountTester.java
and run the resulting program. Since
your methods will be stubs, the program won't really do anything yet,
but you'll at least have a shell that compiles and runs.
What methods will you need?
new WordCounter()
" statement in main
will work. countWord
should accept a word and either
add it to the list or increase the count if the word was already on the list. printReportAlphabetized
should print the list of words and their
counts, sorted in alphabetical order. printReportByCount
should print the list of words and their counts,
sorted with the highest-count words first. Second, I would decide on instance variables for WordCounter
. Here, I
have a little suggestion. I would create a class called WordWithCount
or some such thing. This class would have two public instance variables:
a String
called word
and an int
called count
. The whole point of the WordWithCount
class would to be to represent an ordered pair of a word and its count--like ("the",47)
or ("aardvark",2). This class wouldn't need any methods--maybe a constructor, but
nothing else.
Once I had a WordWithCount
class, I would give
WordCounter
exactly one instance variable: an
ArrayList<WordWithCount>
. That is, the data contained
by our WordCounter
instance would be a list of (word,count) pairs.
Next, I would write a very simple version of countWord
. Have it
instantiate a new WordWithCount
object containing the specified word, and
add this object to the ArrayList
of words. This won't solve the problem
of increasing the word count when you encounter a word for the second or third or
fourth time, but it will be a start.
Then write a simple version of printReportAlphabetized
to just print
out the contents of the ArrayList
in whatever order it happens to be in.
Neither of these methods will be in their final forms at this point, but if you do as I have suggested, you will now have a running program that reads words from a file, adds them to a list, and then prints out the list. This isn't the end goal, but it's an important intermediate goal, and not terribly difficult to achieve.
I think I would next write a private method called
sortAlphabetical
that would alphabetize the ArrayList
of (word,count) pairs. Then, adding a single sortAlphabetical()
call to the beginning of printReportAlphabetized
, I would have
a program that prints out the words from the file in alphabetical order. Again,
not the final goal (the counts will all be 1 and repeated words will appear
multiple times), but it's inching closer to the end.
I would continue to ignore printReportByCount
, and
now return to repair countWord
. The thing to do here is to scan
through the ArrayList
, looking for the word being counted. If you
find it, then you can increase its count. Otherwise, instantiate a
new WordWithCount
as you did a couple steps ago and add it to
the list.
Success? Now you have a program that prints out an alphabetized list of the words in the file, with counts and no word repetitions. Just like you want.
All that's left is to produce the list sorted by count. Again, I would
add a private method to do the sorting (sortByCount()
), call
it at the start of printReportByCount
, and then use the same
code to print the ArrayList
as I wrote for printReportAlphabetical
.
And that, my friends, is that. Not easy, not a short process. But if you do it in small steps like this, with a functioning (though not complete) program at the end of each step, you'll make your way to a completed program in an orderly way. This approach takes a certain amount of discipline, but it pays back in lots of ways, both technical (better design) and emotional (less chaos and panic).
Here's my version of WordWithCount.java, which you may use or not use, as you see fit.
Once you have this program, it's fun to count words in all sorts of stuff. Most common word in Hamlet? Number of times "Elizabeth" appears in Pride and Prejudice? No problem.
Start early, have fun, and keep in touch.