CS 201: Data Structures

Word Clouds and Binary Search Trees

Partners: you may work alone or with a partner of your choosing

[Thanks to Sherri Goings for the idea and supporting code for this assignment.]

Word clouds, also known as "tag clouds," provide an interesting view of the words used in a document. Here, for example, is a word cloud based on the text of Alice in Wonderland:

The key feature of a word cloud is that words are displayed in a size proportional to the number of times they are used in the text on which the cloud is based. (Note that very common words, also known as stopwords, are typically not included in the cloud. Otherwise, all English word clouds would be dominated by "the", "and", "a", "in", etc.)

For this assignment, you will implement a binary search tree, with which you will count the non-stopwords that appear in the text file of your choice. The top (word, count) pairs can then be fed into a word cloud generator to yield pretty pictures.

Overall structure and command-line syntax

Your program will come in three main pieces:

A class called WordCountMap consisting of a binary search tree that maintains a list of words and their counts. (WordCountMap will make use of two very small classes called Node and WordCount.)
A class called WordCounter containing your main program and any support methods it needs. The main program will be in charge of opening a text file, counting all of the words it contains, and producing one of three types of output.
A class in WordCloudMaker.java, written by Sherri Goings with a few modifications by me, that will transform a list of word counts into an HTML file showing a simple word cloud.

Your WordCounter's main method will parse the command-line arguments to support the following three ways of running the program:

java WordCounter alphabetical textFileName

This will print out a list of words and their occurrence counts, one word per line, each line consisting of a word, a colon, and the word's count. This list will be sorted alphabetically by word. For example:

abide:1 above:3 absence:1 absurd:2 acceptance:1 ...
java WordCounter byCount textFileName

This is the same as the the alphabetical case, except the words will be sorted in decreasing order by count:

alice:391 very:146 little:133 out:119 one:108 down:107 ...
java WordCounter cloud textFileName numberOfWordsToInclude

This one will print HTML to standard output, containing a word cloud based on Sherri's code. A typical invocation of the cloud generator would be: java WordCounter cloud alice.txt 40, which would generate the word cloud based on the 40 most common non-stopwords in alice.txt. (If alice.txt contains fewer than 40 non-stopwords, then the cloud will just use all the words.)

WordCountMap specifics

WordCountMap will make use of two small (and nearly identical) classes for storing (word, count) pairs. For internal storage inside your binary search tree, you'll need something like this:

private class Node { public String word; public int count; public Node leftChild; public Node rightChild; }

The exact details of Node are up to you, since it's a class that will be invisible to users of WordCountMap. On the other hand, you'll also need to produce lists of (word, count) pairs as return values from some of your WordCountMap's public methods, so we also need a class for that:

public class WordCount { public String word; public int count; }

Because I'm specifying a few of the WordCountMap methods strictly, you'll need to make sure WordCount has word and count as public instance variables as shown above, and is stored in a separate WordCount.java file. You may add to WordCount if you find it helpful to do so.

Finally, your WordCountMap class must include the following methods:

/** * If the specified word is already in this WordCountMap, then its * count is increased by one. Otherwise, the word is added to this map * with a count of 1. */ public void put(String word) /** * Returns a list of WordCount objects, one per word stored in this * WordCountMap, sorted in decreasing order by count. */ public ArrayList<WordCount> getWordCountsByCount() /** * Returns a list of WordCount objects, one per word stored in this * WordCountMap, sorted alphabetically by word. */ public ArrayList<WordCount> getWordCountsByWord()

Note that the grader and I might test your code using our own main programs, and thus you must adhere to the specifications for WordCount, put, getWordCountsByCount, and getWordCountsByWord.

A few last notes

Hand in WordCountMap.java, WordCounter.java, WordCount.java, WordCloudMaker.java, stopwords.txt, and any other files required to run your program.

Make sure you adhere to the specifications for the command-line interface, WordCount, and the three WordCountMap methods described above. Also, make sure that WordCountMap implements an ordinary binary search tree--not some other word-counting data structure.

Want to mess with WordCloudMaker.java or stopwords.txt? Go right ahead.

I chose the "word:count" output structure for a reason. If you go to the "advanced" page at wordle.net, you can paste your "word:count" lines into their text box and get a great word cloud, customizable for many features. It's fun to count your own writing and seeing what your personal word cloud looks like.

As my grandma always used to say...

Start early, ask questions, and have fun!