HW9: Random Sentence Generator : CS 111

The goal of this assignment is to learn a bit about probabilistic models of sequential data like language, and also to get practice with dictionaries and string processing.

Create a directory named random-sentences that will contain your code for this project. Download the following three text files and put them in that directory:

(I originally downloaded these files from Project Gutenberg, and then modified them slightly to make them easier to work with for this assignment.)

It is strongly recommended that you work on this assignment with a partner. Don't be shy; find somebody new to work with! If you'd like to work in a pair, please figure out who you'd like to work with via Piazza or email, and then email me once your pair is decided. Make sure to put both people's names at the top of each file. Write your code by pair-programming: code together, at the same computer, and switch off who's at the keyboard. Only one of you should ultimately submit your team's code via Moodle.

A “Markov chain” is a mathematical model of a sequence of events: one event after another, unfolding through time. At any given point in time, the choice of event that happens next depends only on the event that is currently happening; there is no “memory” of previous events. For example, imagine a frog jumping among lily pads that has no memory of where it's been before. Its choice of the next lily pad it will jump to is influenced only by its current position: it can only jump to nearby pads, but it might very well jump right back to the one it just came from. The sequence of lily pads the frog visits could be modeled by a Markov chain. (Technically, this memorylessness only applies to “stationary Markov chains”; you can read more about Markov chains on Wikipedia.)

Here's a picture of a Markov chain:

This chain has three “states” — Rainy, Snowy, and Sunny — that correspond to events we might observe: a rainy day, a snowy day, or a sunny day. Between each pair of states is a transition: an arrow from one state to another (or to the same state), labeled with the probability of moving between those states. So, for example, this Markov chain says that if it's sunny on some particular day, there's a 75% chance it'll be sunny the next day too, a 15% chance it'll be rainy the next day instead, and a 10% chance it'll be snowy. Notice that the probabilities on the transitions out of each state all sum to 1 (or 100%).

Clearly a Markov chain isn't a perfect representation of the world; there's a lot more nuance to the weather than just fixed probabilities like in the picture above. But simplified mathematical models like these can still be useful (or, in our case, goofy and fun).

We can construct Markov chains by hand (for example, by looking at a picture of all the lily pads in the frog's pond), but we can also derive them directly from observations of sequences of states (for example, by counting how many snowy days follow sunny days in a whole year). This is called “training” a Markov chain.

In this assignment, you'll train a Markov chain on a body of text chosen by your user. The states will be words, and the transitions will correspond to probabilities of having a given word followed by some other word. You'll compute the transitions from the source text. Then you'll generate new sentences by hopping through the Markov chain (this is a technique called Markov chain Monte Carlo). The result is a set of nonsensical but reasonably well-structured sentences that carry some flavor of the source text.

In a file called random-sentences.py, create an interactive program that asks the user for a filename, trains a Markov chain on the text contained in the file, and then prints randomly-generated sentences from this chain until the user asks to quit.

An example run of the program would look like this:

> python random-sentences.py 
Please enter a filename: alice_test.txt
Press <Enter> to generate a random sentence; type "quit" to quit. 
  I think me the bank, and pictures or a mouse, you know.
Press <Enter> to generate a random sentence; type "quit" to quit. 
  Australia?
Press <Enter> to generate a random sentence; type "quit" to quit. 
  Dinah, tell me see: that she jumped up and picking the air!
Press <Enter> to generate a random sentence; type "quit" to quit. exit
  [Sorry, I didn't understand that.]
Press <Enter> to generate a random sentence; type "quit" to quit. quit
  [Okay, bye!]

Divide your program up into several functions, each of which has a single task to accomplish. Follow the design guidelines demonstrated in the previous homework assignment:

Plan out the structure of your program in advance, as best as you can. Think through how to break down the one big problem you have (write the program) into a bunch of separate subproblems (write just a function to get input from the user, for example). Then make empty functions for each of these subproblems.
Functions should be separated into those that interact with the user (receiving typed input or printing output to the screen), those that do sophisticated computations, and those that tie together the other pieces.
Provide a docstring for every function that you write. In the docstring, specify the function's inputs (arguments) and output (return value).
Put all of the logic of your program in functions. The top-level code of your program (the stuff that isn't indented at all) should consist entirely of function definitions and basic commands to call a top-level function (usually called main()).
As you develop a sophisticated program, always try to keep it in a state where it runs (maybe incompletely) without producing errors. Add one feature at a time, always bringing your code back to a non-error-producing state before you try to add a new feature.
Mark unimplemented features with a “TODO” comment, and delete the comment once you've implemented the feature.

Detailed Description

Representing the Markov Chain

The states in your Markov chain will be words. Use a dictionary to represent the Markov chain, with each unique word as a separate key. The value associated with each key represents the full set of transitions away from that word. The value is a list of all the words that followed the key in the source text. For example, say our source text is:

The quick brown fox jumps over the brown log and trips over the brown log and lands on its face.

Then here is a selection of key-value pairs for the dictionary representing the Markov chain trained on this text:

chain['The'] = ['quick']
chain['quick'] = ['brown']
chain['brown'] = ['fox', 'log', 'log']
chain['fox'] = ['jumps']
chain['and'] = ['trips', 'lands']
chain['over'] = ['the', 'the']
chain['the'] = ['brown', 'brown']
chain['on'] = ['its']
chain['its'] = ['face.']

This is only a selection of the key-value pairs for this Markov chain; as stated above, this dictionary would actually have one key for each unique word in the sentence (except the last word, if that one is unique, since it has no successors).

Notice that we don't explicitly represent transition probabilities here. Instead, to transition out of a state, we will just pick a random entry from the list of possible next states; states may be repeated in this list (like “log” in the list for “brown”) and thereby be more likely to be chosen.

Notice also that the value associated with each key is always a list, even if that list contains only one item. This is important to make your code clean.

Training the Markov Chain

The states in your Markov chain will be all the “words” in the file. A “word”, for our purposes, is just a sequence of non-whitespace characters, separated from other words by one or more whitespace characters. (Whitespace characters are spaces, newlines, tabs, etc.) Conveniently enough, there is a string method that gives you a list of all the words (as we just defined them) in a given string. You may have to do some additional work to make sure that your list of words doesn't contain any empty strings; sometimes those can crop up.

(Notice that in the example dictionary above, capitalization and punctuation were included in words. So "The" and "the" are considered different words, as would be "face." and "face" and "Face". This is intentional; it adds to the quality of the sentences you generate. So make sure that you don't remove capitalization or strip punctuation!)

So, once you've got this list of words, step through it one-by-one and build the transition lists. When adding a transition (say, from word A to word B) to the dictionary, you must first check if the key (A) is already in the dictionary. If so, you simply append B to the list already defined for that key. If not, though, you need to add the key to the dictionary, and associate it with a new list. (What should the contents of that list be?)

You'll probably need to refer to the documentation of Python dictionary methods for this task. Remember that the method has_key() is defined for dictionaries (so is the operator in, which does the same thing); this is precisely what you need in order to decide what to do.

Choosing Starting Places

Before we can start bouncing around the Markov chain, we need to know where to start. This start state will be the first word of our sentence. Naturally, we need to choose one of the states whose word is capitalized. To make this choice easy, you will first generate a list of all the keys that begin with a capital letter. We'll call these “start words”.

The following code snippet prints out all the keys in the dictionary called “chain”; you may find it helpful to construct a similar for-loop:

for key in chain.iterkeys():
    print key

Generating Sentences

To generate a sentence from the Markov chain, we first pick a random start word, then pick the next word by choosing randomly from the list associated with that start word. Then we pick the next word after that by choosing from the list associated with the current word. We keep doing this until we reach an “end word” — one whose last character is ., ?, or !. At this point, we can construct a single string by joining together, in order, all the states that we visited.

Note that the example text files I gave you have all the quotation marks, etc. stripped from them, so that this simple rule for identifying end words will work. If you'd like to use other text samples (which you should! It's fun!), you'll either need to manually or automatically strip out the quotation marks, etc., or build a more sophisticated function for testing whether a given word is an end word.

Python has a module called random that you should use to make these random selections. Check out the documentation of the random module. In particular, the functions random.randrange() and random.sample() might be useful. Remember that you need to have the line import random at the top of your source code in order to have access to the functions defined in that module.

Suggestions

Here's how I would break up the problem:

main() — Runs the whole show.
getUserFile() — Gets the filename from the user and return the file's contents.
wordList() — Returns a list of strings: all the words in a text sample passed as input to the function.
trainMarkovChain() — Trains the Markov chain on a given list of words, and returns the dictionary representing the Markov chain.
getStartWords() — Identifies all the start words among the keys of a given dictionary (and returns them as a list).
isEndWord() — Returns True or False, depending on whether a word ends with sentence-ending punctuation.
randomSentenceLoop() — Repeatedly asks the user whether to generate a new random sentence or quit, and prints out the sentence if requested.
makeRandomSentence() — Returns a string produced by a random walk through a given Markov chain.

Submit your program via Moodle.

Interactivity — 4 points
Markov-chain training — 7 points
Random sentence generation — 7 points
Comments, style, and design — 7 points

HW9: Random Sentence Generator

Goals

Setup and Requirements

Background