CS 223 Assignment due at class time, 5/23/01

You may work with a partner on this assignment.

For this assignment, you will write a computer program to count bigrams, and to apply your bigram data to the problem of identifying the author of a given text passage.

Background

A bigram is an ordered pair of words found in some sort of input text. For example, the bigrams found in the first sentence on this page are "you may", "may work", "work with", etc.

Bigrams and their more general pals, n-grams, are used in many natural language processing contexts to provide information about how frequently a given pair of words will occur together in text. The usual strategy for using bigrams is to start with a large text file, called the training corpus, and count up the bigrams that appear in the corpus. Using these counts, you can estimate the conditional probabilities P( word2 | word1 ), that is, the probability of seeing word2 next, given that the previous word is word1.

Consider for example the following training corpus:

	humpty dumpty sat on a wall
	humpty dumpty had a great fall
	all the king's horses and all the king's men
	couldn't put humpty together again

This corpus gives us P( dumpty | humpty ) = 2/3, because "humpty" appears three times in the corpus, and "dumpty" is the word immediately following "humpty" two of those times.

Suppose you are given a sentence and asked to state its probability. Sentences like "together put humpty and and and" seem intuitively less likely than "the king's horses sat on humpty dumpty," but in what sense can that be said to be true?

Consider "the king's horses sat". We could say:

	P( the king's horses sat ) =
			P( the ) * P( king's | the ) * P( horses | king's ) * P( sat | horses )

Based on our tiny corpus, this probability would come out to be (2/26) * (2/2) * (1/2) * (0/1), that is to say, 0. That's not very helpful, but it will be common.

The assignment

First, your program should take two files as input (e.g. the text of Hamlet, and the complete poems of Emily Dickinson--see www.gutenberg.net for a supply of interesting text files), and count up the bigrams for each file separately.

Once your program is done counting, it should open a third file that contains example sentences. For each sentence, your program should print out the sentence and its two probabilities, one computed using (for example) the Shakespeare bigram data, and the other using the Dickinson bigram data. That is, your program will perform computations sufficient to choose which of the two authors is more likely to have written each example sentence.

Please have your program either prompt the user for the file names, or take the file names as command-line arguments.

Hand in your program using HSP.

Start early, have fun, and keep in touch.

Jeff Ondich, Department of Mathematics and Computer Science, Carleton College, Northfield, MN 55057, (507) 646-4364, jondich@carleton.edu