CS 322: Natural Language Processing
Document Classification
Suppose you are given 50,000 text document and charged with the task of
separating them into three piles office memos, legal opinions, and press
releases. (Who wrote them? How did they get all mixed up? The hypothetical
world contains so many mysteries! How about this scenario: hard-drive water
damage has left a data recovery team with thousands of file fragments but
no file name or path information.) This is a document classification
problem.
You could hire a staff of temporary workers to go through the pile
manually, but that would be slow and expensive, and wouldn't help you
solve the next document classification problem somebody sends your way.
How might you automate this process?
For this project, we'll use an approach based on n-gram language models.
Roughly, here's how you should proceed.
- Choose some classes of text to use to develop your system.
- Get a bunch of data from each of your chosen classes.
One easy source of some long documents is
gutenberg.org. This would be
a good choice if you decide, for example, that you want to develop a system to
distinguish between Jane Austen, Mark Twain, and whoever translated Voltaire
into English.
- For each of your document classes, select some of the data
for training and some of the data for testing. You might, for
example, let Pride and Prejudice be the training set for
the Jane Austen language model, and let the pages or paragraphs of
Sense and Sensibility go into the test set. Or half of
each in the training, and the other halves into testing (why might
you choose the latter?)
- Obtain the
SRI Language Modeling Toolkit
and get to know it (especially the ngram-count utility).
- Decide whether to clean your data. Should you separate the data
into one sentence per line? Should you tokenize it so "line?" at
the end of the previous sentence becomes two separate tokens ("line" and "?")?
Should you make it all lower case? If yes to any of these, then either
find a utility that will do it for you, or write one yourself. If no, don't.
Either way, provide justification for your choice.
- For each training set (cleaned or not), use the SRI Toolkit to create a language
model file. You'll need to choose a smoothing method.
- For each item in the test set (e.g. a collection of paragraphs from
Sense and Sensibility, Life on the Mississippi, and
Candide), compute the log probability of the test item against
each of your language models. Use this information to classify the
test item.
- Collect data on the effectiveness of your classification system.
- Put everything (code, test results, rationale, discussion of results, etc.)
into a report, and prepare to summarize your experiences in class. (You'll
have about five minutes to discuss your choices and results.)
I'll spend class time between now and the presentation day filling in
some missing background for you, answering questions, and trying to provide
general guidance.
Have fun!