CS257 Assignment: Profiling

Due Wednesday, 4/21/04. Submit on paper at class time.

You may work with a partner on this assignment, if you wish.

The notorious LZW algorithm

In the June 1984 issue of IEEE's Computer magazine, Terry Welch of Unisys published an article entitled A Technique for High-Performance Data Compression. Welch's article described a simple way to implement the well-known Lempel-Ziv compression algorithms, developed during the seventies (see J. Ziv and A. Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, May 1977). By 1987, Unisys obtained patents on Welch's work in the US, Canada, the UK, and Japan.

The existence of these patents was not widely known among software developers. Indeed, it is likely that anybody familiar with the work of Lempel and Ziv would have considered Welch's paper to be a good piece of work, but also a very natural implementation-oriented extension the LZ ideas--the kind of thing any smart programmer would have come up with after a careful study of LZ. Like Quicksort and AVL trees, the LZW algorithm quickly took its place as a standard algorithmic tool. LZW was incorporated into the GIF and TIFF image format standards, and in the October 1989 issue of Dr. Dobb's Journal, Mark Nelson published a practical article entitled LZW Data Compression, complete with a clean and easy-to-understand implementation in C.

But in 1994, Unisys started to enforce its patents. Rather than repeat the story here, I will point you to The GIF Controversy: A Software Developer's Perspective, by Michael C. Battilana, in case you are interested in learning more. You might also find the Unisys policy on LZW interesting to read.

All of this left many software developers confused about how to use compression legally, without license fees, in their software. Between LZW-derived algorithms (of which there are many) and GPL'ed algorithms, I have been unable to find an algorithm I could use with confidence in my own commercial software. Until now, that is. Last summer, the US patent on LZW expired, and this coming summer, the patents in Canada, the UK, and Japan will also expire. So I have begun experimenting with Nelson's code.

Your job

A couple years ago, I made a few modifications to Nelson's code to get it to compile with g++. This week, I added a main program to enable me to test LZW on a specific compression task. The sources are lzw.h, lzw.cpp, and lzw_main.cpp. Read the comment at the top of lzw_main.cpp, construct some appropriate test data (you'll want a few kilobytes, at least), compile the program, and give it a test run.

In his Dr. Dobb's article, Nelson says:

The code accompanying this article works. However, it was written with the goal of being illuminating, not efficient. Some parts of the code are relatively inefficient. For example, the variable length input and output routines are short and easy to understand, but require a lot of overhead. An LZW program using fixed length 12 bit codes could experience real improvements in speed just by recoding these two routines.

For this assignment, you will use the gprof profiler to find out where our LZW code is spending most of its time, and then try to make modifications to the code to make it run faster. Here's a list of steps for you to take.

  1. Use the gprof documentation to learn how to generate a profile of the LZW program.

  2. Generate one or more profiles of runs of the code, and identify the portions of code that are most deserving of attention if you wish to improve the speed of the program.

  3. Now, try to speed up the program. First, try speeding it up while keeping the flexibility of the variable-length output codes (that is, allowing the BITS constant to be set to various values at compile time).

  4. Next try making modifications on the assumption that BITS is 12. That is, the output codes will have length 1.5 bytes, making the shifting and masking of various quantities easier to do than when you use 10, 11, 13, or 14-bit output codes.

  5. Write a report about your experiences. Include the essential information from your original profiles and your post-modification profiles, show your modifications, and discuss the quality of your results. I would like the report to be no more than three pages long, including data, but I want a clear and reasonably thorough explanation of what you did, and how effective your efforts were. Tell me the story of your attempt to speed up Nelson's code.

Start early, have fun, and keep in touch.