CS 251: Parser

This is an individual assignment. Since the whole interpreter project is otherwise done in teams, I do want to ensure that you can do a piece of it on your own. This does build on previous code that you have written as a team. You are welcome to use that team code that you have written, or you can alternatively use the binaries below that I have provided. Likewise, when you move on to the next project (which will be team-based again), you may choose the solution that one of you has written, or you may use a binary solution that I will provide.

Implement a parser for Scheme, in C. (You must implement the parser from scratch - don't use Yacc or Bison or any similar tool, even if you know what it is and how to use it.)

Getting started

After you check out your blank repository, download this zip file and add the Makefile and the .h and .c files to your git project. The files you get include:

linkedlist.h: no changes from last time
talloc.h: no changes from last time
value.h: no changes from last time
tokenizer.h: no changes from last time
parser.h: The header file for the functions you'll need to write for this assignment. You'll also create a number of helper functions for that, but those don't need to appear in the header since they will be "private".
main.c: Short main function that drives the whole program.

At this point, you have a choice regarding how to proceed linkedlist.c, talloc.c, and tokenizer.c. If you want to continue to build your own interpreter that is truly yours, you can copy in these files from the last project, and move forward. Alternatively, if you would prefer to use my versions instead, you can do so. In the lib directory you'll see linkedlist.o, talloc.o, and tokenizer.o files. These are compiled binary versions of my linkedlist.c and talloc.c. If you would prefer to use one or many of these in your project instead of your own, you can replace them in the Makefile. Specifically, in the Makefile, on the line starting with SRCS, you can replace talloc.c with lib/talloc.o, etc. I suspect you are likely best off taking all of the files together, as opposed to mixing and matching yours with mine, but you may try if you like.

Note that compiled binaries are operating system specific (sometimes version specific). I have compiled these .o files to work with the course virtual machine.

To compile your code, issue the command "make" at the command prompt. (This will give an error if you haven't yet created the .c files you need, or have changed the Makefile to refer to the necessary .o files.)

To run your code, first create a file with some Scheme code in it, and give it a filename (e.g., input.txt). Then issue this command at a prompt:

      ./parser < input.txt

Your code should read data from standard input. In that case, the above command will redirect the text in the input.txt file into your program.

To run your code through valgrind, I have provided a shell script to set all the appropriate options. You can do it this way:

      ./valgrind.sh ./parser < input.txt

The assignment

You should create a parser.c file that implements the functions specified in parser.h. Particularly, you will need to write a function parse that uses the token list from the last assignment to construct a list of parse trees.

For example, here is a syntactically correct Scheme program:

      (define x (quote a))
      (define y x)
      (car y)

Any Scheme program consists of a list of S-expressions. An S-expression always has parentheses around it, unless it is an atom. Note that a Scheme program itself, then, is not a single-S-expression; it is a list of S-expressions.

Your parse function should therefore return a list, using the linked list structure that we have been using, of parse trees. A parse tree, in turn, uses the linked list structure again to represent a tree. For example, the expression (define x (quote a)) would become the following parse tree, where each box is a Value:

The above parse tree stucture would be the first item in a 3-item linked list that represents the above Scheme program.

Do not include parentheses in your parse tree. Their purpose is to tell you the structure of the tree, so once you have an actual tree you don't need them anymore - they'll only get in the way. (Technically, this means that you're creating an abstract syntax tree, not a parse tree, since not all of the input is preserved.)

Parsing languages can be pretty complicated. The good news is that parsing Scheme is really easy, at least in a relative sense. Since your Scheme code IS a tree, you just need to be able to read it as such and turn it into a tree in memory. You can do it with a stack of tokens:

Initialize an empty stack.
While there are more tokens:
- Get the next token.
- If the token is anything other than a close paren, push it onto the stack.
- If the token is a close paren, start popping items from your stack until you pop off an open paren (and form a list of them as you go). Push that list back on the stack.

So you'll need a stack... your linked list is a fine candidate for it.

You'll also need to write a printTree function. The output format for a parse tree of valid Scheme code is very simple: it looks exactly like Scheme code! To output an internal node in the parse tree, output ( followed by the outputs of the children of that internal node from left to right followed by ). To output a leaf node (a non-parenthetical token), just output its value as you did in the tokenizer, sans type.

Syntax errors

As with the tokenizer, your parser should never crash with a segmentation fault, bus error, or the like. Here are the different types of syntax errors you should handle:

If the input code ever has too many close parentheses (in other words, if you ever encounter a close paren that doesn't match a previous open paren), print Syntax error: too many close parentheses.
If the parse function returns an incomplete parse tree and the end of the input is reached and there are too few close parentheses, print Syntax error: not enough close parentheses.

If you ever encounter a syntax error, your program should exit after printing the error - don't do any more parsing. This is why we wrote the function texit. Also feel free to print Scheme code around the error to be helpful if you can, though this optional.

Most production parsers continue to try to parse after an error to provide additional error messages. Your efforts here may help you to gain some sympathy as to why all error messages after the first one are questionable!

Sample output

$ cat test1
(if 2 3)
(+ ))

$ cat test2
(define map
  (lambda (f L)
    (if (null? L)
        L
        (cons (f (car L))
              (map f (cdr L))))))

$ cat test3
1 2 (3)

$ ./parser < test1
Syntax error: too many close parentheses

$ ./parser < test2
(define map (lambda (f L) (if (null? L) L (cons (f (car L)) (map f (cdr L))))))

$ ./parser < test3
1 2 (3)

Formatting your output

You may find that it is easier to produce output similar to the above but with extraneous white space. For me, I had to hack some extra code to make sure that the last item of a list didn't have a space between it and the closing paren that followed it. You may have extraneous whitespace in your output if you wish, but I would optionally encourage you to make your output match the above. Your final product will look a lot more "real" if you put this effort in. You will not lose points for extra spaces, assuming that they aren't in the middle of strings or something like that.

Sample code

Here is a rough approximation of part of my parse function. My addToParseTree function takes in a pre-existing tree, a token to add to it, and a pointer to an integer depth. depth is updated to represent the number of unclosed open parentheses in the parse tree. The parse tree is complete if (and only if) depth == 0 when the parse function returns.


Value *parse(Value *tokens) {

    Value *tree = makeNull();
    int depth = 0;

    Value *current = tokens;
    assert(current != NULL && "Error (parse): null pointer");
    while (current->type != NULL_TYPE) {
        Value *token = car(current);
        tree = addToParseTree(tree,&depth,token);
        current = cdr(current);
    }
    if (depth != 0) {
        syntaxError(); // error case 4
    }
}

Memory errors

Your code should have no memory leaks or memory errors when run on any input (correct or incorrect) using valgrind. We will be checking this during grading.

What to submit

Push everything (c files, h files, Makefile, etc.) to your git repository.

You should also include in your submission a collection of at least 4 pairs of test files. (More is better!) Each pair of files should be named parser-test.input.XX and parser-test.output.XX, where XX is a test number. It should be the case that if I run the following command then I get no output:

./parser < parser-test.input.XX | diff parser-test.output.XX -

When we test your code for grading purposes, we will use our own set of test files which we will run using exactly the same command as described above. Test carefully.

Your code should have no memory leaks or memory errors when run on any input (correct or incorrect) using valgrind. We will be checking this as well during grading.

Remember, you can test your code in valgrind this way:

./valgrind.sh ./parser < parser-test.input.XX

Have fun parsing!

This assignment was originally created by David Liben-Nowell and has since been updated by Dave Musicant and Laura Effinger-Dean.