Due 11:59PM Wednesday, 4/6/05. Submit via HSP.
Fans of well-designed standards are pleased that HTML has a worthy successor, XHTML (both are described in detail at the W3C web site). The argument in favor of XHTML over HTML centers around the many advantages you gain when you separate content from its presentation. A very simple example of the kind of separation we're talking about is the difference between the <i> tag and the <em>. They both indicate text that is deserving of visual emphasis, but the former explicitly defines the nature of that emphasis (italics), while the latter leaves the visual choice open (and allows it to be specified in a stylesheet). I'm not going to get into the details of the argument in favor of the XHTML way, but I do think that argument is quite persuasive, and so I am gradually moving my own web pages towards XHTML.
For this assignment, you will write a program that takes one or more HTML files as input, and produces various kinds of output. Some of the features of your program will assist you in transforming your existing HTML to XHTML, while some of the features are just handy tools for looking at either HTML or XHTML.
Your program should be able to do any appropriate combinations of the following, controllable by command-line arguments.
Change <i> tags to <em> tags.
Change <b> tags to <strong> tags.
Change all tags to lower case.
Change <br> tags to <br />.
Detect and report on mismatched tags. For example, if there is an <h3> tag with no matching </h3>, your program should report the line number of the existing tag, along with an appropriate error message. Another problem is bad nesting, like this: <em><p>some text</em></p>.
Print the title of the document (which appears between <title> and </title> tags), followed by a list of all the headings that appear in the document. Each heading should appear on its own line, with <h1> headings flush left, <h2> indented by 4 spaces, <h3> indented 8 spaces, etc. The intent of this feature is to give its user a quick overview of the outline structure (if any) of the document.
There are, of course, many other tasks to be performed to automatically transform HTML into XHTML, and there exists a tool (HTML Tidy) to perform those tasks. However, the list above is sufficient for our purposes.
A readme.txt file describing the command-line syntax of your program, the status of the program (what works, what doesn't, etc.), and the structure of your test cases.
Your testing structure. This may consist of only test data (and a description of how to perform and evaluate the tests, in the readme.txt), or it may include shell scripts or other programs that run the tests.
Your source code.
I want you to write this program in C, C++, or Java. If you are familiar with Perl, you may rightly consider it to be an appropriate language for this assignment. But part of what I want you to confront here is the task of recognizing patterns in a language that is not specifically designed for doing exactly that. We will discuss Perl later this term.
As you design and write this program, I want you to focus on a couple important ideas. The first is the careful design of the command-line interface. You will want your program to work effectively in Unix pipelines where appropriate, and to allow your users to combine features that can be reasonably combined. The second important idea is good testing. I have intentionally designed this project to consist of a collection of related features that can be implemented separately, but that might share code, too. This gives you an opportunity to do incremental development, and to develop tests that can be applied at any stage after the feature they are testing is complete.
Have fun.