Implement the Earley algorithm. Your implementation should accept two text files and a string as input. The text files will contain a grammar and a lexicon, respectively. The string will be a sentence to be parsed. For each valid interpretation of the sentence in the given grammar, your program will print out a parse tree, and the sentence tagged by part of speech (e.g. "They [n] study [vt] fish [n] in [prep] cans [n]").
Use the handout I gave you in class as the basis for your algorithm. Note that the handout tells you how to build the parse chart, but does not tell you how to extract parse trees from the chart--that job will be for you.
The grammar file will be a text file each of whose lines is either a comment (starting with a # sign), a blank line, or a rule. For example, the following would be a simple grammar file:
# My little grammar
S : NP VP
# Noun phrases
NP : det n
NP : n
# Verb phrases
VP : v
VP: v NP
I have included some quirkly formatting here (inconsistent spacing before the colon, extra spaces in some places). This is to emphasize that whitespace acts as a delimiter between symbols, but should be ignored otherwise. Also, non-terminal constituents are all caps and lexical constituents are all lower case.
The lexicon file should have the same format, like so:
# My li'l lexicon
v : run
v : study
n : fish
n:cans
n : arrow
[etc.]
If you are careful in your design, you should be able to write a common set of functions to read both the lexicon file and the grammar file.
Printing out the parse tree is tricky in text, but there are ways to do it. Here, for example, is one approach:
S
NP
n - I
VP
vt - like
NP
n - candy
Go ahead and be creative. Sometimes a clever use of brackets or line spaces can greatly enhance the readability of your output.
Have fun, start early, and keep in touch.