CS395 Assignment: Parser

CS395: Parser

Due: 1:10PM Friday, February 6

Your job for this assignment is to write a parser, based on Algorithm 4.7 on page 218. This parser will be designed to interact with your lexical analyzer, so when you're done, you should be able to feed a small Java fragment and a grammar (e.g. your boolean expression grammar) to your combination of programs, and receive a parse as output.

First steps

To get started, we need to adjust your lexical analyzer (possibly very slightly) to make it provide an appropriate interface for the parser, and to provide shared behavior to make it easier for us to discuss the assignment as a group.

Define your token types to be integers starting with 0, in this order: delimiter=0, operator=1, reserved word=2, identifier=3, character literal=4, string literal=5, integer=6, real number=7, end-of-input=8, unknown=9.

Some of you have fewer token types than this. If you have only a NUMBER type rather than separate INTEGER and REAL types, then make NUMBER=6. If you have no UNKNOWN type, then you'll never generate a token type of 9--no problem.
Similarly, number your delimiters starting at 0. Their order should be ; ( ) [ ] { }.
Operators should also be numbered from 0, in the order + - * / % = == < <= > >= != ++ -- && || . -> << >>.
Make your getnexttoken function return the token type, and place the token data in a global variable to which your parser has access.

Input: the grammar file format

For this assignment, the grammar file will consist of a list of productions, one per line. The goal of the file format is to make it easy to parse--we're trying to parse the token stream, and I don't want to get bogged down in a different parsing problem along the way. Thus, the format is going to be simple, low on features, and slightly ugly.

Here is an example. Consider our usual left-recursive grammar for arithmetic expressions.

E -> E + T
E -> T
T -> T * F
T -> F
F -> (E)
F -> id

We will represent this grammar like so:


E: E [2,0] T
E: T
T: T [2,2] F
T: F
F: [1,1] E [1,2]
F: [4,0]

What's going on here? First of all, nonterminals are all going to be single upper-case letters. (You may certainly make your code more flexible than that if you wish, but don't burn time on doing so until you have gotten the parsing completed.) Second, the colon is there as a nod to readability, and you may assume if you wish that it is always the second character in the line. Third, you should ignore whitespace.

Finally, terminals are represented using a "[type,subtype]" notation. In this example, [1,2] represents a delimiter (type=1) of subtype "open parenthesis" (subtype=2). For a type like "identifier" that doesn't have subtypes, just use 0 for the subtype ([4,0] in this example).

Again, I tried to design this to be easy to parse. The brackets notation is definitely a hassle to read, but it should be relatively easy for your program to parse. If you see ways to make this format easier to deal with, let me know.

Input: the token stream

Just call getnexttoken when you need to.

Output

For output, your program should print a reasonably readable list of productions, as generated by Algorithm 4.7.

That's all

Start early, have fun, and keep in touch.