CS 251: Tokenizer

Implement a tokenizer for Racket, in C. (You must implement the tokenizer from scratch -- don't use Lex or any similar tool, even if you happen to know what it is and how to use it.)

Getting started

After you check out your blank repository, download this zip file and add the .h and .c files to your git project. The files you get include:

At this point, you have a choice regarding how to proceed for linkedlist.c and talloc.c. If you want to continue to build your own interpreter that is truly yours, you can copy in these files from the last project, and move forward. Alternatively, if you would prefer to use my versions instead, you can do so. In the lib directory you'll see linkedlist.o and talloc.o files. These are compiled binary versions of my linkedlist.c and talloc.c. If you would prefer to use one or both of these in your project instead of your own, you can replace them in the Makefile. Specifically, in the Makefile, on the line starting with SRCS, you can replace talloc.c with lib/talloc.o, and/or replace linkedlist.c with lib/linkedlist.o. If you go this route, make sure to copy those .o files into a subdirectory called lib.

Note that compiled binaries are operating system specific (sometimes version specific). I have compiled these .o files to work with the course virtual machine.

To compile your code, issue the command "make" at the command prompt. (This will give an error if you haven't yet created the .c files you need, or have changed the Makefile to refer to the necessary .o files.)

To run your code, first create a file with some Racket code in it, and give it a filename (e.g., input.txt). Then issue this command at a prompt:

./tokenizer < input.txt

Your tokenizer code should read data from "stdin." In that case, the above command will redirect the text in the input.txt file into your program.

To run your code through valgrind, I have provided a shell script to set all the appropriate options. You can do it this way:

./valgrind.sh < input.txt

Assignment details

You should create a tokenizer.c file that implements the functions specified in tokenizer.h. More generally, your tokenizer should read Racket code from standard input and write to standard output a list of tokens accompanied by the type of each token. You must include the following token types:

     boolean, integer, float, string, symbol, open, close

(the last two are for ( and )). The output format is, per line, token:type. You should strip out comments as you tokenize, so that anything after a ; on a line is ignored. (You'll also have to think about how to handle whitespace -- sometimes it matters! sometimes it doesn't!)

For example, here's a sample run of my code:

$ cat token-test.input.1
(+ x2 ( + ( quote x ) "foo;; still a string" 323) ;; now it's a comment!

$ cat token-test.input.1 | ./tokenizer
(:open
+:symbol
x2:symbol
(:open
+:symbol
(:open
quote:symbol
x:symbol
):close
"foo;; still a string":string
323:integer
):close

Your tokenizer should handle bad input gracefully; your program should never crash with a segfault, bus error, or the like. If the input code is untokenizable, print an error message. You may print something generic such as "Syntax error: untokenizeable", or you can provide more detailed information in the error message depending on what the problem is. After encountering an error, your program should exit after printing the error - don't read any more input. This is why we wrote the function texit. Also feel free to print Racket code around the error to be helpful if you can, though this optional.

Syntax details

Racket is a complex language. The the official Racket syntax documentation is enough to make your head spin. In the interest of sanity, here is a simplified syntax for numbers that I expect your tokenizer to handle (adapted from here):

<number>   ::= <sign> <ureal> | <ureal>
<sign>     ::= + | -
<ureal>    ::= <uinteger> | <udecimal>
<uinteger> ::= <digit>+
<udecimal> ::= . <digit>+ | <digit>+ . <digit>*
<digit>    ::= 0 | 1 | ... | 9

Note that * is shorthand for zero or more repetitions of the symbol, and + is shorthand for one or more repetitions of the symbol. Tip: if you want to convert strings to numbers, you can use the functions atoi and atof in the standard library.

Similarly, you should recognize symbols (identifiers) with the following syntax (again adapted from here):

<identifier> ::= <initial> <subsequent>* | + | -
<initial>    ::= <letter> | ! | $ | % | & | * | / | : | < | = | > | ? | ~ | _ | ^
<subsequent> ::= <initial> | <digit> | . | + | -
<letter>     ::= a | b | ... | z | A | B | ... | Z
<digit>      ::= 0 | 1 | ... | 9 

This may be inconsistent with the behavior of Racket. If you wish to instead implement DrRacket's behavior in its entirety, that can be an optional extension at the end of the project.

Symbols are case-sensitive.

For string literals, you should be able to handle escaped characters including (at least) \n (newline), \t (tab), \\ (backslash), \' (single quote), and \" (double quote). You do not need to handle multi-line strings; any newlines in a string must be explicitly escaped with \n. We will not test your program with multiline strings.

You may also find Section 1.1 of this reference book on Scheme to be helpful. The dialect described here is a little different than Racket, but it's considerably more readable. The BNF above is considered to be the correct specification for this assignment.

Don't worry about block and datum comments or vectors, or for that matter any wacky syntax, unless you want to push this really far. Start from the basics and build up. By the way, if it helps, my code generates no tokens given input like "error (an unterminated string). Try to match the behavior of Racket whenever you're in doubt.

When testing your code, we will intend to test on reasonable programs that follow well-understood and documented Racket conventions. We may try some corner cases that will attempt to see if you have coded carefully, but we do not intend to find obscure details of the Racket language standard to see if you have carefully deciphered every bit of it.

Racket provides many different ways of enclosing lists (parentheses, square brackets, etc.) We will only test code that uses parentheses. None of our test code will use square brackets. You can write your tokenizer and later parts of the project to only work on parentheses. That said, of you wish your project to also work on square brackets, it isn't that hard of an extension. If you think you want to go that way, you'll need to make sure that you tokenize parentheses and brackets differently.

Some parting hints

I suppose it is theoretically possible to code this assignment up so that it produces precisely the right output, but without actually storing the data in a list. Don't do that; perhaps that might be easier to get results on this portion of the project, but it will leave you in a useless position to continue for the next portion.

There are many different ways of reading input files. I found it most useful to use the functions fgetc and occasionally ungetc. Look those up. They're certainly more useful than fscanf.

The heart of your code will be your tokenize function, which reads the file and returns a list of Values. Here's a rough sketch of what that function might look like:

Value *tokenize() {
    char charRead;
    Value *list = makeNull();
    charRead = fgetc(stdin);

    while (charRead != EOF) {

        if (charRead == .....) {
            ...
        } else if (charRead == ......) {
            ...
        } else {
            ...
        }
        charRead = fgetc(stdin);
    }

    Value *revList = reverse(list);
    return revList;
}

What to submit

Push all of your code into your git repository. Your submitted files should be exactly what you checked out, but also with a file named tokenizer.c. Your Makefile may have changed if you modified it to work with my binaries. If you did this, add a readme.txt file explaining what you did.

You should also include in your submission a collection of at least 4 pairs of test files. (More is better! The sooner you get your tokenizer totally right, the sooner you'll get the rest of your project right!) Each pair of files should be named tokenizer-test.input.XX and tokenizer-test.output.XX, where XX is a test number. It should be the case that if I run the following command then I get no output:

./tokenizer < tokenizer-test.input.XX | diff tokenizer-test.output.XX -

When we test your code for grading purposes, we will use our own set of test files which we will run using exactly the same command as described above. Test carefully.

Your code should have no memory leaks or memory errors when run on any input (correct or incorrect) using valgrind. We will be checking this as well during grading.

Remember, you can test your code in valgrind this way:

./valgrind.sh < tokenizer-test.input.XX

Have fun tokenizing!