[fadden-logo]

Back to index


Hacking Data Compression  Lesson 12  HardPressed(tm)
By Andy McFadden
GEnie A2Pro Apple II University - Copyright (C) 1993 GEnie


After a brief six-month hiatus, I'm back with the final lesson.  This
week we'll be discussing a program that uses the techniques we've been
studying: HardPressed(tm).

Since there are files on the HardPressed install disk that describe
both HP and HP modules in great detail, I have chosen to keep this
lesson at a fairly high level.


-=O=-     Quick Review

- Differential: Stores each byte as a difference from the previous.
Sometimes used with adaptive Huffman on sound files.  c.f. ADPCM.
- RLE: Run Length Encoding.  xyaaaaabbbb becomes xy5a4b (more or
less).
- Huffman encoding: Builds a tree of output symbols by frequency, so
that more frequent symbols get shorter codes.  Can be static,
semi-adaptive, or adaptive.
- Splay Tree encoding: rather simple encoding scheme that adapts
rapidly to changing input.
- Arithmetic encoding: A theoretically optimal encoding method that
stores the input as one large floating-point number.

- MTF: Move To Front.  Keeps a list of the most frequently seen words,
moving each to the front of the list when it appears.  Outputs the
word's index, which is usually encoded with Arithmetic encoding.
- Digram: Computes frequencies on groups of letters, resulting in more
accurate predictions.  A sophisticated form of this (context modeling)
plus an arithmetic encoder represent the current state of the art in
text compression.

- LZ77: Ziv-Lempel, 1977.  Input strings are gathered with a sliding
window, and output as <len, posn, char> triples.
- LZSS: Improved LZ77 scheme that got rid of the extra character.
- LZH: Improved LZSS algorithm that used Huffman encoding on the
output values to improve compression efficiency.  Variations of this
are used in almost all general-purpose file archivers (Zip, Zoo, etc).
- LZARI: Like LZH, but with arithmetic encoding.

- LZ78: Ziv-Lempel, 1978.  Builds a tree with unique prefixes,
outputting <prefix, char> for each match.
- LZW: Improved LZ78 scheme that got rid of the extra character.

-=O=-     To Stream or Not to Stream

Almost all of the code presented during this course took its input one
character at a time and produced output one character at a time.
Except for stuff like semi-adaptive Huffman (which read its input
twice, once to generate the frequency table, once to do the
compression), all of the implementations read and wrote data serially.
No block transfers, no backing up at the end.

This is commonly referred to as working on a "stream" of data, as
opposed to working on a "block" of data.  From an abstract point of
view, the data arrives and leaves in a continuous stream.  Some
programs (e.g. ShrinkIt) read in (say) 4K of data, compress it, and
then output it as a distinct chunk before going on to the next block.

I chose to take the "streaming" approach for a couple of reasons.
First, most of the algorithms lend themselves to this sort of
implementation.  Some of them were designed to compress data as it was
written to a disk or over a network, where you can't hold onto the
data for long and you can't undo what you've done.

Second, it allows you to do some neat things without having to change
the core of the implementation.  For example, suppose you wanted to
"chain" RLE and LZW together.  As I mentioned a few lessons ago,
ShrinkIt does this to compress disk images efficiently.  With the
course approach, it's easy to hook them together: the input to the LZW
compressor can come straight from the RLE compressor.  It just
requires a little fiddling with the Getc() and Putc() routines.

By some strange twist of fate, HardPressed's compression has the same
basic design.

-=O=-     Modulus

HardPressed uses compression "modules", executable files that are
loaded when the system boots.  They are very much like the compression
code that I've written for this course, in that they don't do any file
manipulation themselves.  It's all handled through calls to the main
part of HardPressed, much like the course code used Getc() and Putc()
to receive and send data.

Unlike the course code, the modules are designed for speed, not
legibility.  The source code tends to be rather knotted, with lots of
shortcuts taken for efficiency's sake.  For this reason (and some
copyright stuff) I haven't included source code for any of the actual
modules used in HP.  You can find a few examples, along with a
reasonably complete description of the modules, on the HardPressed
disk (see the "Goodies" folder).

-=*=-

To a distant observer, they would appear the same.  The RLE source
code and the HardPressed RLE module take a file as input, do more or
less the same stuff to it, and produce output.  The length of the file
is irrelevant; it just keeps sucking data in and spitting it out until
it's all done.  On the inside, however, things are much different.

An important issue that had to be considered is that dealing with
input a character at a time is slow.  The program has to do a function
call or JSR for every single character read or written.

To solve this, all of the information is handled in blocks.  This
complicates the program logic, because instead of just saying, "give
me the next character", it has to check to see if more data is
available, if there's room left in the output buffer, and so on.  A
lot more is expected of the module, but by doing it all inline instead
of with procedure calls, a great deal of overhead is shaved off.

The fun part is that it still APPEARS to be handling data in a
continuous stream.  The output file produced is, in most cases,
identical to that produced by the lessons in this course.  HardPressed
maintains the abstraction of a stream while using a block-oriented
implementation for speed.

-=*=-

As I hinted at earlier, HardPressed modules can be chained together in
any order.  (The current implementation limits the chain length to 8
modules, but that's arbitrarily imposed.)  Chaining the modules
together is actually almost trivial given the design of the system.

Because all data transfers are done in large blocks, all that is
necessary to chain modules together is to use one module's output
buffer as the next module's input buffer.  As long as a module is able
to read its data from anywhere in memory, there's no need to copy the
data between buffers (an activity that is frightfully expensive on a
IIgs).

Fortunately, all of this switching between compressors is hidden from
the module.  The sole concerns of an HP module are compressing data
and managing its input and output buffers.  This allows module authors
to concentrate on manipulating data as quickly as possible, without
having to worry about where its coming from or going to.

For gruesomely detailed information on the modules, see
"Writing.Modules" and "Module.Reference" in the Goodies folder of the
HardPressed disk.

-=O=-     The Mote in GS/OS's Eye

HardPressed is (in my biased opinion) a fine example of an application
of data compression.  Until a few years ago, compression was usually
in the form of archiving programs like ShrinkIt, Zip, or UNIX
compress: things that you, Joe User, do to a file to make it smaller.
When you want to access the file, you, Joe User, execute a related
command to uncompress it.

Now there are "automatic" compression utilities for most major
platforms.  Compressed files are expanded automatically when accessed,
and recompressed when they're no longer needed.  HardPressed lives
between GS/OS and user applications, working transparently as well as
automatically.

When choosing what kind of compression to use for this sort of
endeavor, there are a number of factors that must be taken into
account:

     - how fast is it?  Since this is going to be happening all the
time, it's got to be fairly fast, or nobody will want to use it.

     - how well does it compress?  No point in dealing with the hassle
if the files don't get any smaller.

     - how small is it?  Nobody wants to kiss 2MB of memory goodbye
just to make their files smaller.

Suppose you had to choose from the compression techniques we've
studied.  Which ones would YOU choose?

(time's up)

The modules currently supported by HardPressed are:

     - RLE.  This is easily the fastest of the bunch, and it does
fairly well on things like applications and sparse files.

     - LZW.  If you played with this for any length of time, you know
that when you take compression time, expansion time, and amount of
compression into consideration, this one is tough to beat.

     - LZSS.  This one is deathly slow when compressing, but it can
expand in half the time that LZW takes because it's doing little more
than copying the strings out of the buffer.  Converting it to LZH or
LZARI is impractical because of the speed loss.

     - differential & adaptive Huffman.  Alone, these are mostly
worthless.  Together, they could provide an effective way to compress
digitized sound files (lossless).

Most of the rest are either too slow or merely comparable to the
above.  Stuff like semi-adaptive Huffman (which generates the
frequency table in one pass and does the compression in another) can't
be used because you only get to look at the data once.

-=*=-

I bring this up to make a simple point: you have descriptions of and
source code for all the compression techniques needed for a commercial
product.

-=O=-     PROGRAM

Ain't none.  Well, there's one slightly silly sample, but that's it.

First thought was to write a general library of routines, allowing you
to assemble it with (say) "rle.asm" to get a working HardPressed
module.  This turned out to be harder than it seemed, because there's
a lot of really unpleasant stuff that has to be done to make it work
correctly.  It wouldn't have been particularly enlightening to
examine.

Second thought was to just convert rle.asm and lzw.asm into
HardPressed modules.  Didn't seem all that hard.  However, it started
getting all tangled up because the calling sequences are very
different.  For the course stuff, Encode is called once, and expected
to handle the whole file.  It calls Getc() to get more data.  For HP
modules, StartUp is called once, and then Encode is called several
times with blocks of data until the whole thing is done.  Instead of
calling DOWN for more data, Encode returns UP and waits for more.

Modifying course stuff to work with HP would either require some major
logic shifts or some gritty code in the new Getc and Putc routines.
It would've been easier to write it over again, but that wouldn't show
you anything that the sample code on the HP disks doesn't.

So, there's nothing useful here, which is probably not a bad thing
considering how huge some of the previous lessons have been.

-=O=-     That's All, Folks...

Well, this is the end of the line.  Maybe this should be called
Gilligan's Course... instead of a 3 hour tour that lasted years, we
had a 12 week course that took 9 months. :-)

Hope some of these lessons have been useful.  I'd like to think that
my explanations were more helpful than those in textbooks, because *I*
learned some of this stuff for the first time while writing the course
(splay trees? what's that?), so I was very aware of gaps in other
authors' explanations.

I hope that having commented and tested source code proves useful.  A
lot of this stuff is either hard to find or buried inside code for
some other program.  The code in the lessons isn't the most efficient
possible implementation, but it should be relatively easy to convert
it for your own use.

Hasta.

-=!=-

This document is Copyright by Genie and Andy McFadden