[Thanks to Sherri Goings for the idea and supporting code for this assignment.]
Word clouds, also known as "tag clouds," provide an interesting view of the words used in a document. Here, for example, is a word cloud based on the text of Alice in Wonderland:
The key feature of a word cloud is that words are displayed in a size proportional to the number of times they are used in the text on which the cloud is based. (Note that very common words, also known as stopwords, are typically not included in the cloud. Otherwise, all English word clouds would be dominated by "the", "and", "a", "in", etc.)
For this assignment, you will implement a binary search tree, with which you will count the non-stopwords that appear in the text file of your choice. The top (word, count) pairs can then be fed into a word cloud generator to yield pretty pictures.
Your program will come in three main pieces:
Your WordCounter's main method will parse the command-line arguments to support the following three ways of running the program:
java WordCounter alphabetical textFileName
This will print out a list of words and their occurrence counts, one word per line, each line consisting of a word, a colon, and the word's count. This list will be sorted alphabetically by word. For example:
java WordCounter byCount textFileName
This is the same as the the alphabetical case, except the words will be sorted in decreasing order by count:
java WordCounter cloud textFileName numberOfWordsToInclude
This one will print HTML to standard output, containing a word cloud based on Sherri's code. A typical invocation of the cloud generator would be: java WordCounter cloud alice.txt 40, which would generate the word cloud based on the 40 most common non-stopwords in alice.txt. (If alice.txt contains fewer than 40 non-stopwords, then the cloud will just use all the words.)
WordCountMap will make use of two small (and nearly identical) classes for storing (word, count) pairs. For internal storage inside your binary search tree, you'll need something like this:
The exact details of Node are up to you, since it's a class that will be invisible to users of WordCountMap. On the other hand, you'll also need to produce lists of (word, count) pairs as return values from some of your WordCountMap's public methods, so we also need a class for that:
Because I'm specifying a few of the WordCountMap methods strictly, you'll need to make sure WordCount has word and count as public instance variables as shown above, and is stored in a separate WordCount.java file. You may add to WordCount if you find it helpful to do so.
Finally, your WordCountMap class must include the following methods:
Note that the grader and I might test your code using our own main programs, and thus you must adhere to the specifications for WordCount, put, getWordCountsByCount, and getWordCountsByWord.
Hand in WordCountMap.java, WordCounter.java, WordCount.java, WordCloudMaker.java, stopwords.txt, and any other files required to run your program.
Make sure you adhere to the specifications for the command-line interface, WordCount, and the three WordCountMap methods described above. Also, make sure that WordCountMap implements an ordinary binary search tree--not some other word-counting data structure.
Want to mess with WordCloudMaker.java or stopwords.txt? Go right ahead.
I chose the "word:count" output structure for a reason. If you go to the "advanced" page at wordle.net, you can paste your "word:count" lines into their text box and get a great word cloud, customizable for many features. It's fun to count your own writing and seeing what your personal word cloud looks like.
Start early, ask questions, and have fun!