CS 201: Data Structures (Winter 2018)
HW11: Cloud dreamer
Due: Monday, 03/05 at 22:00
1. Goals
To implement a tree-based structure that can function similarly to a map.
2. Setup
This is a pair programming assignment. You will continue with your same partner (unless I have told you otherwise).
Download the starter files and unzip.
3. Introduction
Word clouds, also known as "tag clouds," provide an interesting view of the words used in a document. Here, for example, is a word cloud based on the text of Pride and Prejudice:
The key feature of a word cloud is that words are displayed in a size proportional to the number of times they are used in the text on which the cloud is based. For example, Elizabeth Bennet and Mr. Darcy are quite central to Pride and Prejudice, so these words appear large, as expected.
Very common words, also known as stopwords, are typically not included in the cloud. Otherwise, all English word clouds would be dominated by "the", "and", "a", "in", etc.
For this assignment, you will implement a search tree, with which you will count the non-stopwords that appear in the text file of your choice.
4. Specification
Your program will come in three main pieces:
-
A class called
TreeWordCountMap
consisting of a search tree that maintains a list of words and their counts. (TreeWordCountMap
will contain two very small nested classes calledNode
andWordCount
.) -
A class called
CloudDreamer
containing your main program and any support methods it needs. The main program will be in charge of opening a text file, counting all of the words it contains, and producing one of three types of output. -
A class called
WordCloudMaker
, which takes words and frequencies as input and prints word cloud in HTML as output like this one. This class, originally written by Sherri Goings, is supplied to you, and you do not need to make any edits toWordCloudMaker
.
CloudDreamer specifics
Your CloudDreamer
's main method will parse the command-line
arguments to support the following three ways of running the program:
java CloudDreamer textFileName alphabetical
This will open the file
textFileName
and count the number of times each word occurs. The output format is one word per line, each line consisting of a word, a colon, and the word's count. This list will be sorted alphabetically by word. For example:abatement:1 abhorrence:6 abhorrent:1 abide:2 abiding:1 abilities:6 ...
java CloudDreamer textFileName frequency
This is the same as the alphabetical case, except the words will be sorted in decreasing order by count:
Mr:783 Elizabeth:594 very:471 such:376 Darcy:370 Mrs:343 ...
java CloudDreamer textFileName cloud numberOfWordsToInclude
This one will print HTML code of a word cloud using the included
WordCloudMaker
.If I run
java CloudDreamer pride.txt cloud 23
, it would generate a word cloud based on the 23 most common non-stopwords inpride.txt
. (Ifpride.txt
contains fewer than 23 non-stopwords, then the cloud will use all the words.)You may assume that
stopwords.txt
is present in the same directory.
TreeWordCountMap specifics
Inside your TreeWordCountMap
class, you will have two small
nested classes to store (word, count) pairs.
-
First, write a
WordCount
class that starts like this:public static class WordCount { private String word; private int count; public String getWord(); public int getCount(); }
We are making it public because it will be used as a way to giveWordCloudMaker
a list of (word, count) pairs.WordCloudMaker
refers to this nested class asTreeWordCountMap.WordCount
. It also expectsWordCount
to have public getter methodsgetWord
andgetCount
, which you should implement. You do NOT need setter methods sinceTreeWordCountMap
can freely accessWordCount
's private instance variables. This is whyWordCount
should be nested insideTreeWordCountMap
. Next, write a
Node
class that starts like this:private static class Node { private WordCount leftWord; private WordCount rightWord; private Node leftChild; private Node middleChild; private Node rightChild; }
We are making this private because nodes are implementation details internal to
TreeWordCountMap
: users of this class do not need to know the kind of nodes you are using. As such, you are free to change this class as you see fit. However, for ease of grading, please keep the names (and types) of the five instance variables listed above.
The weird Node
above might give you a clue that, instead of
implementing a binary search tree, whose code is in your textbook, you will
implement a slight variant.
- Each node of your tree either contains one word or two words.
- If a node contains one word (i.e.,
rightWord
isnull
) then it is a leaf (has no children). - If a node contains two words, it has up to three children.
- If it exists,
leftChild
and all its children contain words beforeleftWord
. - If it exists,
middleChild
and all its children contain words afterleftWord
but beforerightWord
. - If it exists,
rightChild
and all its children contain words afterrightWord
.
You may recognize that this is similar to a 2-3 tree. However, making the tree balanced is optional.
Finally, your TreeWordCountMap
class must include the following
public methods:
/** If the specified word is already in this map, then its count is increased by one. Otherwise, the word is added to this map with a count of 1. */ public void increment(String word); /** Returns true if word is contained in this map. */ public boolean contains(String word); /** Returns a list of WordCount objects, one per word in this map, sorted alphabetically by word. */ public List<WordCount> getWordCountsByWord(); /** Returns a list of WordCount objects, one per word in this map, sorted in decreasing order by count. */ public List<WordCount> getWordCountsByCount();
Make one of your getWordCountsBy*
methods O(n). The
other can be slower.
Note that the grader and I might test your code using our own main programs,
and thus you must adhere to the specifications for
WordCount
, increment
, contains
,
getWordCountsByWord
, and getWordCountsByCount
.
(Bail out) If you have trouble getting all this to work, implement a
standard binary search tree instead so you can get points for the
CloudDreamer
part of the assignment (and partial credit for
TreeWordCountMap
). If you do this, take special care to not be
copying code from any source, including the textbook.
5. Code notes
-
Make sure that
TreeWordCountMap
implements the specified tree structure. You should not use other data structures such asTreeMap
orHashMap
. You must implement the tree structure directly yourself. -
I do not care if you treat "alacrity" or "Alacrity" as the same word or
different words for the purpose of this assignment. However, note that
stopwords.txt
only contain words in lower case. As such, when determining whether a word is a stopword, case should be ignored. -
You should strip punctuations from words. One way to do this is to use the
String
methodreplaceAll
. The regular expression[^a-zA-Z]
(including the square brackets) matches any character that is NOT a letter. -
You can save the HTML output (or any output from command-line programs) by
using shell redirect with the
>
syntax:java CloudDreamer textFileName cloud 23 > myCloudOutput.html
-
You do NOT need to make any changes to
WordCloudMaker.java
. But if you want to make changes; go ahead. The same applies tostopwords.txt
. - I included Jane Austen's Pride and Prejudice from Project Gutenberg. You may want to download other books to test (and read!).
- I chose the "word:count" output structure for a reason. If you go to the "advanced" page at wordle.net, you can paste your "word:count" lines into their text box and get a great word cloud, customizable for many features. It's fun to count your own writing and see what your personal word cloud looks like.
- Optional extra challenge: If you are feeling very ambitious, make your tree a balanced 2-3 tree. Note: this is hard, and should be pursued only if you desire the challenge. If you do this, you would (of course) need to allow 2-nodes to have children.
-
Easier extra challenges:
Implement
get
,remove
, or other methods ofjava.util.Map
.
6. Submission and grading
Submit to Moodle azip
archive containing:
CloudDreamer.java
andTreeWordCountMap.java
.
WordCloudMaker.java
, submit that, too.
Follow the previously established guidelines.
Assignment requirements
This is a partial list of the things that we'll be looking for when evaluating your work (in addition to style):
-
Strict adherence to the specifications for the command-line interface,
WordCount
, and the requiredTreeWordCountMap
methods. -
(At least) one of
getWordCountsByWord
andgetWordCountsByCount
is O(n). - Implement tree structure directly, without using any Java classes for trees or maps.
- Nodes with one word are leaves (unless you are attempting the optional extension of making the tree balanced).
Grading
- Assignment requirements [20 points]
- Comments, style, and design [5 points]
This assignment modified from one designed by Sherri Goings and Jeff Ondich.
Start early, have fun, and discuss questions on Moodle.