CS257 Assignment: HTML Filter

Due Wednesday, 4/7/04. Submit via HSP.

Fans of well-designed standards are pleased that HTML has a worthy successor, XHTML (both are described in detail at the W3C web site). The argument in favor of XHTML over HTML centers around the many advantages you gain when you separate content from its presentation. A very simple example of the kind of separation we're talking about is the difference between the <i> tag and the <em>. They both indicate text that is deserving of visual emphasis, but the former explicitly defines the nature of that emphasis (italics), while the latter leaves the visual choice open (and allows it to be specified in a stylesheet). I'm not going to get into the details of the argument in favor of the XHTML way, but I do think that argument is quite persuasive, and so I am gradually moving my own web pages towards XHTML.

Your program

For this assignment, you will write a program that takes one or more HTML files as input, and produces various kinds of output. Some of the features of your program will assist you in transforming your existing HTML to XHTML, while some of the features are just handy tools for looking at either HTML or XHTML.

Your program should be able to do any appropriate combinations of the following, controllable by command-line arguments.

There are, of course, many other tasks to be performed to automatically transform HTML into XHTML, and there exists a tool (HTML Tidy) to perform those tasks. However, the list above is sufficient for our purposes.

What to hand in

Other information

I want you to write this program in C, C++, or Java. If you are familiar with Perl, you may rightly consider it to be an appropriate language for this assignment. But part of what I want you to confront here is the task of recognizing patterns in a language that is not specifically designed for doing exactly that. We will discuss Perl later this term.

As you design and write this program, I want you to focus on a couple important ideas. The first is the careful design of the command-line interface. You will want your program to work effectively in Unix pipelines where appropriate, and to allow your users to combine features that can be reasonably combined. The second important idea is good testing. I have intentionally designed this project to consist of a collection of related features that can be implemented separately, but that might share code, too. This gives you an opportunity to do incremental development, and to develop tests that can be applied at any stage after the feature they are testing is complete.

Have fun.