CS 322: Natural Language Processing

Document Classification

Presentations Wednesday Oct 6, code and write-ups due Friday, Oct 8

Suppose you are given 50,000 text document and charged with the task of separating them into three piles office memos, legal opinions, and press releases. (Who wrote them? How did they get all mixed up? The hypothetical world contains so many mysteries! How about this scenario: hard-drive water damage has left a data recovery team with thousands of file fragments but no file name or path information.) This is a document classification problem.

You could hire a staff of temporary workers to go through the pile manually, but that would be slow and expensive, and wouldn't help you solve the next document classification problem somebody sends your way. How might you automate this process?

For this project, we'll use an approach based on n-gram language models. Roughly, here's how you should proceed.

  1. Choose some classes of text to use to develop your system.
  2. Get a bunch of data from each of your chosen classes. One easy source of some long documents is gutenberg.org. This would be a good choice if you decide, for example, that you want to develop a system to distinguish between Jane Austen, Mark Twain, and whoever translated Voltaire into English. But of course, the Internet is full of documents from bloggers, courts, census bureaus, scientific labs, cookbooks, etc.
  3. For each of your document classes, select some of the data for training and some of the data for testing. You might, for example, let Pride and Prejudice be the training set for the Jane Austen language model, and let the pages or paragraphs of Sense and Sensibility go into the test set. Or half of each in the training, and the other halves into testing (why might you choose the latter?)
  4. Obtain the SRI Language Modeling Toolkit and get to know it (especially the ngram-count utility).
  5. Decide whether to clean your data. Should you separate the data into one sentence per line? Should you tokenize it so "line?" at the end of the previous sentence becomes two separate tokens ("line" and "?")? Should you make it all lower case? If yes to any of these, then either find a utility that will do it for you, or write one yourself. If no, don't. Either way, provide justification for your choice.
  6. For each training set (cleaned or not), use the SRI Toolkit to create a language model file. You'll need to choose a smoothing method.
  7. For each item in the test set (e.g. a collection of paragraphs from Sense and Sensibility, Life on the Mississippi, and Candide), compute the log probability of the test item against each of your language models. Use this information to classify the test item.
  8. Collect data on the effectiveness of your classification system.
  9. Put everything (code, test results, rationale, discussion of results, etc.) into a report, and prepare to summarize your experiences in class. (You'll have about five minutes to discuss your choices and results.)

Have fun!