COS 371: Programming Languages

Spring 2022

proj4: Parser

Due: Monday, 04/11 at 22:00

1. Goals

To implement a parser for Scheme in C.

2. Introduction

Before tokenized Scheme code can be evaluated, the tokens need to be parsed into a syntax tree. This is harder said than done: i.e., despite the lengthy description below, this second phase is actually much easier than the first phase! (You must implement the parser from scratch: don't use Yacc or Bison or any similar tool, even if you happen to know what they are and how to use them.)

3. Tasks

  1. Create a header file parser.h (with #include guard, as always) with the following two function prototypes:
    Value *parse(Value *tokens);
    void printTree(Value *tree);
    
    The parse function should take a linked list of tokens from a Scheme program (as in the output of your tokenizer), and it should return a pointer to a parse tree representing that program. The printTree function should display a parse tree to the screen, using parentheses to denote tree structure. (In other words, the output of printTree should visually look identical to legal Scheme code.)
  2. Implement parse and printTree in parser.c. As usual, you may want to write additional helper functions, but they should not be usable/visible outside of parser.c and thus they should not appear in the header file.
  3. Create main_parse.c that, after including appropriate header files, simply does this:
    int main(void) {
       Value *list = tokenize();
       Value *tree = parse(list);
       printTree(tree);
       printf("\n");
       tfree();
       return 0;
    }
    
  4. Modify your Makefile to compile your new parser code. When I'm in your directory and I execute the command make parser, I should get an executable file called parser in that directory.
  5. Create a collection of at least 4 pairs of test files. (More is better! The sooner you get your parser totally right, the sooner you will get the rest of your project right.) Each pair of files should be named test.parser.input.XX and test.parser.output.XX, where XX is a test number. It should be the case that if I run the following command then I get no output:
    ./parser < test.parser.input.XX | diff - test.parser.output.XX
    
    (Your code may be tested against my test files or other teams' test files.)

4. Specification

Here is the spec for printTree(parse(tokenize())): read Scheme code from standard input and write to standard output a sequence of parse trees of that Scheme code, or indicate syntax error.

For example, here is a syntactically correct Scheme program:

(define x (quote a))
(define y x)
(car y)

An S-expression is either

  1. an atom (boolean, integer, float, string, or symbol) or
  2. a sequence of zero or more S-expressions enclosed in a pair of parentheses.
Note that a Scheme program itself, then, is not a single S-expression; it is a sequence of S-expressions.

Here is the grammar:

<program>  ->  <S-expr>*
<S-expr>   ->  <atom> | "(" <S-expr>* ")"
<atom>     ->  boolean | integer | float | string | symbol

Your parse function should therefore return a list, using the linked list structure that we have been using, of parse trees. A parse tree, in turn, uses the linked list structure again to represent a tree.

For example, the expression (define x (quote a)) would become the following parse tree, where each box is a Value:

parse tree

The above parse tree structure would be the first item in a 3-item linked list that represents the above Scheme program.

Do not include parentheses in your parse tree. The purpose of the parentheses is to denote the structure of the tree, so once we have an actual tree we don't need them anymore: they'll only get in the way. (Technically, this means that you're creating an abstract syntax tree, not a parse tree, since not all of the input is preserved.)

Parsing languages can be pretty complicated. The good news is that parsing Scheme is really easy, at least in a relative sense. Since Scheme code really is just a tree, you just need to be able to read it as such and convert it into a tree in memory. You can do this with a stack of tokens and shift-reduce parsing. In fact, it is so simple that I don't want to show you the outline to spoil the fun.

(Spoiler alert) But you can click here to see it.

So you'll need a stack... your linked list is a fine candidate for it.

You'll also need to write a printTree function. The output format for a parse tree of valid Scheme code is very simple: it looks exactly like Scheme code! To output an internal node in the parse tree, output ( followed by the outputs of the children of that internal node from left to right followed by ). To output a leaf node (a non-parenthetical token), just output its value as you did in the tokenizer, sans type.

Syntax errors

As with the tokenizer, your parser should never crash with a segmentation fault, bus error, or the like. Here are the different types of syntax errors you should handle:

  1. If the input code ever has too many close parentheses (in other words, if parse ever encounters a ) that doesn't match a previous (), print Syntax error: too many close parentheses.
  2. If the input code has too few close parentheses (in other words, if parse reaches the end of the list of tokens while waiting for a )), print Syntax error: not enough close parentheses.

If you ever encounter a syntax error, your program should exit after printing the error: don't do any more parsing. This is why we wrote the function texit. Also feel free to print Scheme code around the error to be helpful if you can, though this is optional.

Most production parsers continue to try to parse after an error to provide additional error messages. Your efforts here may help you to gain some sympathy as to why all error messages after the first one are questionable!

Sample output

For example, here are a few sample runs of my code:
$ cat test1
(if 2 3)
(+ ))

$ ./parser < test1
Syntax error: too many close parentheses

$ cat test2
(define map
  (lambda (f lst)
    (if (null? lst)
        lst
        (cons (f (car lst))
              (map f (cdr lst))))))

$ ./parser < test2
(define map (lambda (f lst) (if (null? lst) lst (cons (f (car lst)) (map f (cdr lst))))))

$ cat test3
1 2 (3)

$ ./parser < test3
1 2 (3)

In your first attempt, you may find that it is easier to produce output similar to the above but with extraneous white space. For me, I had to hack some extra code to make sure that the last item of a list didn't have a space between it and the closing parenthesis that followed it. After you get your parser working properly, please go back and make your output match the above. Your final product will look a lot more "real" if you put this effort in.

Sample code

Here is a rough approximation of a working parse function. My addToParseTree function takes in a pre-existing tree, a token to add to it, and a pointer to an integer depth, which is updated to represent the number of unclosed open parentheses in the parse tree. The parse tree is complete if (and only if) depth == 0 when the parse function returns.

Value *parse(Value *tokens) {
   Value *tree = makeNull();
   int depth = 0;

   Value *current = tokens;
   assert(current != NULL && "Error (parse): null pointer");
   while (current->type != NULL_TYPE) {
      Value *token = car(current);
      tree = addToParseTree(tree, &depth, token);
      current = cdr(current);
   }
   if (depth != 0) {
      syntaxError();
   }
}

5. Submission

Your code should have no memory leaks or memory errors when run on any input (correct or incorrect) using Valgrind. We will be checking this during grading, so you should test your code via Valgrind yourself.

Follow the instructions from the previous assignment. In particular:


This assignment was originally created by David Liben-Nowell and has since been updated by Dave Musicant, Laura Effinger-Dean, and Jed Yang.

Start early, have fun, and discuss questions on Moodle.