CS 201: Data Structures

Word Clouds and Binary Search Trees

Partners: you may work alone or with a partner of your choosing

[Thanks to Sherri Goings for the idea and supporting code for this assignment.]

Word clouds, also known as "tag clouds," provide an interesting view of the words used in a document. Here, for example, is a word cloud based on the text of Alice in Wonderland:

The key feature of a word cloud is that words are displayed in a size proportional to the number of times they are used in the text on which the cloud is based. (Note that very common words, also known as stopwords, are typically not included in the cloud. Otherwise, all English word clouds would be dominated by "the", "and", "a", "in", etc.)

For this assignment, you will implement a binary search tree, with which you will count the non-stopwords that appear in the text file of your choice. The top (word, count) pairs can then be fed into a word cloud generator to yield pretty pictures.

Overall structure and command-line syntax

Your program will come in three main pieces:

Your WordCounter's main method will parse the command-line arguments to support the following three ways of running the program:

WordCountMap specifics

WordCountMap will make use of two small (and nearly identical) classes for storing (word, count) pairs. For internal storage inside your binary search tree, you'll need something like this:

private class Node { public String word; public int count; public Node leftChild; public Node rightChild; }

The exact details of Node are up to you, since it's a class that will be invisible to users of WordCountMap. On the other hand, you'll also need to produce lists of (word, count) pairs as return values from some of your WordCountMap's public methods, so we also need a class for that:

public class WordCount { public String word; public int count; }

Because I'm specifying a few of the WordCountMap methods strictly, you'll need to make sure WordCount has word and count as public instance variables as shown above, and is stored in a separate WordCount.java file. You may add to WordCount if you find it helpful to do so.

Finally, your WordCountMap class must include the following methods:

/** * If the specified word is already in this WordCountMap, then its * count is increased by one. Otherwise, the word is added to this map * with a count of 1. */ public void put(String word) /** * Returns a list of WordCount objects, one per word stored in this * WordCountMap, sorted in decreasing order by count. */ public ArrayList<WordCount> getWordCountsByCount() /** * Returns a list of WordCount objects, one per word stored in this * WordCountMap, sorted alphabetically by word. */ public ArrayList<WordCount> getWordCountsByWord()

Note that the grader and I might test your code using our own main programs, and thus you must adhere to the specifications for WordCount, put, getWordCountsByCount, and getWordCountsByWord.

A few last notes

Hand in WordCountMap.java, WordCounter.java, WordCount.java, WordCloudMaker.java, stopwords.txt, and any other files required to run your program.

Make sure you adhere to the specifications for the command-line interface, WordCount, and the three WordCountMap methods described above. Also, make sure that WordCountMap implements an ordinary binary search tree--not some other word-counting data structure.

Want to mess with WordCloudMaker.java or stopwords.txt? Go right ahead.

I chose the "word:count" output structure for a reason. If you go to the "advanced" page at wordle.net, you can paste your "word:count" lines into their text box and get a great word cloud, customizable for many features. It's fun to count your own writing and seeing what your personal word cloud looks like.

As my grandma always used to say...

Start early, ask questions, and have fun!