CS 322: Natural Language Processing

Part-of-speech tagging

Due 5:00PM Monday, November 22. Hand in your code, data, and report to your Courses hand-in folder.

You may work with a partner on this assignment.

For this assignment, you will implement and test a Hidden Markov Model-based part-of-speech tagger, as discussed in class and in Section 5.5 of the textbook. Follow this outline:

  1. Familiarize yourself with the Penn Treebank Project's part-of-speech tag set. See p. 131 of your textbook. If you want a more detailed discussion of this tag set, visit the Penn Treebank Project site and grab the detailed description of the tag set.
  2. Choose some tagged data available on-line to train a Hidden Markov Model for part-of-speech tagging. You can find likely candidates by looking at the Brown Corpus links at the NLTK website. Make sure to save some data for testing your model.
  3. Run your HMM as a part-of-speech tagger on your test data.
  4. Put your report in a readme.txt file. Your report should include:
    • A description of how you initialized your HMM.
    • A brief description of the code you used to build your part-of-speech tagger and how to use it.
    • A list of your test sentences, showing their correct taggings and the taggings produced by your tagger.
    • The accuracy of your tagger. That is, the ratio between the number of words your tagger tagged correctly and the total number words in your test set.

You might find this spreadsheet HMM useful, or maybe not.