Hacking Data Compression Lesson 12 HardPressed(tm) By Andy McFadden GEnie A2Pro Apple II University - Copyright (C) 1993 GEnie After a brief six-month hiatus, I'm back with the final lesson. This week we'll be discussing a program that uses the techniques we've been studying: HardPressed(tm). Since there are files on the HardPressed install disk that describe both HP and HP modules in great detail, I have chosen to keep this lesson at a fairly high level. -=O=- Quick Review - Differential: Stores each byte as a difference from the previous. Sometimes used with adaptive Huffman on sound files. c.f. ADPCM. - RLE: Run Length Encoding. xyaaaaabbbb becomes xy5a4b (more or less). - Huffman encoding: Builds a tree of output symbols by frequency, so that more frequent symbols get shorter codes. Can be static, semi-adaptive, or adaptive. - Splay Tree encoding: rather simple encoding scheme that adapts rapidly to changing input. - Arithmetic encoding: A theoretically optimal encoding method that stores the input as one large floating-point number. - MTF: Move To Front. Keeps a list of the most frequently seen words, moving each to the front of the list when it appears. Outputs the word's index, which is usually encoded with Arithmetic encoding. - Digram: Computes frequencies on groups of letters, resulting in more accurate predictions. A sophisticated form of this (context modeling) plus an arithmetic encoder represent the current state of the art in text compression. - LZ77: Ziv-Lempel, 1977. Input strings are gathered with a sliding window, and output as <len, posn, char> triples. - LZSS: Improved LZ77 scheme that got rid of the extra character. - LZH: Improved LZSS algorithm that used Huffman encoding on the output values to improve compression efficiency. Variations of this are used in almost all general-purpose file archivers (Zip, Zoo, etc). - LZARI: Like LZH, but with arithmetic encoding. - LZ78: Ziv-Lempel, 1978. Builds a tree with unique prefixes, outputting <prefix, char> for each match. - LZW: Improved LZ78 scheme that got rid of the extra character. -=O=- To Stream or Not to Stream Almost all of the code presented during this course took its input one character at a time and produced output one character at a time. Except for stuff like semi-adaptive Huffman (which read its input twice, once to generate the frequency table, once to do the compression), all of the implementations read and wrote data serially. No block transfers, no backing up at the end. This is commonly referred to as working on a "stream" of data, as opposed to working on a "block" of data. From an abstract point of view, the data arrives and leaves in a continuous stream. Some programs (e.g. ShrinkIt) read in (say) 4K of data, compress it, and then output it as a distinct chunk before going on to the next block. I chose to take the "streaming" approach for a couple of reasons. First, most of the algorithms lend themselves to this sort of implementation. Some of them were designed to compress data as it was written to a disk or over a network, where you can't hold onto the data for long and you can't undo what you've done. Second, it allows you to do some neat things without having to change the core of the implementation. For example, suppose you wanted to "chain" RLE and LZW together. As I mentioned a few lessons ago, ShrinkIt does this to compress disk images efficiently. With the course approach, it's easy to hook them together: the input to the LZW compressor can come straight from the RLE compressor. It just requires a little fiddling with the Getc() and Putc() routines. By some strange twist of fate, HardPressed's compression has the same basic design. -=O=- Modulus HardPressed uses compression "modules", executable files that are loaded when the system boots. They are very much like the compression code that I've written for this course, in that they don't do any file manipulation themselves. It's all handled through calls to the main part of HardPressed, much like the course code used Getc() and Putc() to receive and send data. Unlike the course code, the modules are designed for speed, not legibility. The source code tends to be rather knotted, with lots of shortcuts taken for efficiency's sake. For this reason (and some copyright stuff) I haven't included source code for any of the actual modules used in HP. You can find a few examples, along with a reasonably complete description of the modules, on the HardPressed disk (see the "Goodies" folder). -=*=- To a distant observer, they would appear the same. The RLE source code and the HardPressed RLE module take a file as input, do more or less the same stuff to it, and produce output. The length of the file is irrelevant; it just keeps sucking data in and spitting it out until it's all done. On the inside, however, things are much different. An important issue that had to be considered is that dealing with input a character at a time is slow. The program has to do a function call or JSR for every single character read or written. To solve this, all of the information is handled in blocks. This complicates the program logic, because instead of just saying, "give me the next character", it has to check to see if more data is available, if there's room left in the output buffer, and so on. A lot more is expected of the module, but by doing it all inline instead of with procedure calls, a great deal of overhead is shaved off. The fun part is that it still APPEARS to be handling data in a continuous stream. The output file produced is, in most cases, identical to that produced by the lessons in this course. HardPressed maintains the abstraction of a stream while using a block-oriented implementation for speed. -=*=- As I hinted at earlier, HardPressed modules can be chained together in any order. (The current implementation limits the chain length to 8 modules, but that's arbitrarily imposed.) Chaining the modules together is actually almost trivial given the design of the system. Because all data transfers are done in large blocks, all that is necessary to chain modules together is to use one module's output buffer as the next module's input buffer. As long as a module is able to read its data from anywhere in memory, there's no need to copy the data between buffers (an activity that is frightfully expensive on a IIgs). Fortunately, all of this switching between compressors is hidden from the module. The sole concerns of an HP module are compressing data and managing its input and output buffers. This allows module authors to concentrate on manipulating data as quickly as possible, without having to worry about where its coming from or going to. For gruesomely detailed information on the modules, see "Writing.Modules" and "Module.Reference" in the Goodies folder of the HardPressed disk. -=O=- The Mote in GS/OS's Eye HardPressed is (in my biased opinion) a fine example of an application of data compression. Until a few years ago, compression was usually in the form of archiving programs like ShrinkIt, Zip, or UNIX compress: things that you, Joe User, do to a file to make it smaller. When you want to access the file, you, Joe User, execute a related command to uncompress it. Now there are "automatic" compression utilities for most major platforms. Compressed files are expanded automatically when accessed, and recompressed when they're no longer needed. HardPressed lives between GS/OS and user applications, working transparently as well as automatically. When choosing what kind of compression to use for this sort of endeavor, there are a number of factors that must be taken into account: - how fast is it? Since this is going to be happening all the time, it's got to be fairly fast, or nobody will want to use it. - how well does it compress? No point in dealing with the hassle if the files don't get any smaller. - how small is it? Nobody wants to kiss 2MB of memory goodbye just to make their files smaller. Suppose you had to choose from the compression techniques we've studied. Which ones would YOU choose? (time's up) The modules currently supported by HardPressed are: - RLE. This is easily the fastest of the bunch, and it does fairly well on things like applications and sparse files. - LZW. If you played with this for any length of time, you know that when you take compression time, expansion time, and amount of compression into consideration, this one is tough to beat. - LZSS. This one is deathly slow when compressing, but it can expand in half the time that LZW takes because it's doing little more than copying the strings out of the buffer. Converting it to LZH or LZARI is impractical because of the speed loss. - differential & adaptive Huffman. Alone, these are mostly worthless. Together, they could provide an effective way to compress digitized sound files (lossless). Most of the rest are either too slow or merely comparable to the above. Stuff like semi-adaptive Huffman (which generates the frequency table in one pass and does the compression in another) can't be used because you only get to look at the data once. -=*=- I bring this up to make a simple point: you have descriptions of and source code for all the compression techniques needed for a commercial product. -=O=- PROGRAM Ain't none. Well, there's one slightly silly sample, but that's it. First thought was to write a general library of routines, allowing you to assemble it with (say) "rle.asm" to get a working HardPressed module. This turned out to be harder than it seemed, because there's a lot of really unpleasant stuff that has to be done to make it work correctly. It wouldn't have been particularly enlightening to examine. Second thought was to just convert rle.asm and lzw.asm into HardPressed modules. Didn't seem all that hard. However, it started getting all tangled up because the calling sequences are very different. For the course stuff, Encode is called once, and expected to handle the whole file. It calls Getc() to get more data. For HP modules, StartUp is called once, and then Encode is called several times with blocks of data until the whole thing is done. Instead of calling DOWN for more data, Encode returns UP and waits for more. Modifying course stuff to work with HP would either require some major logic shifts or some gritty code in the new Getc and Putc routines. It would've been easier to write it over again, but that wouldn't show you anything that the sample code on the HP disks doesn't. So, there's nothing useful here, which is probably not a bad thing considering how huge some of the previous lessons have been. -=O=- That's All, Folks... Well, this is the end of the line. Maybe this should be called Gilligan's Course... instead of a 3 hour tour that lasted years, we had a 12 week course that took 9 months. :-) Hope some of these lessons have been useful. I'd like to think that my explanations were more helpful than those in textbooks, because *I* learned some of this stuff for the first time while writing the course (splay trees? what's that?), so I was very aware of gaps in other authors' explanations. I hope that having commented and tested source code proves useful. A lot of this stuff is either hard to find or buried inside code for some other program. The code in the lessons isn't the most efficient possible implementation, but it should be relatively easy to convert it for your own use. Hasta. -=!=-
This document is Copyright by Genie and Andy McFadden