HDC Lesson 1

Back to index
Hacking Data Compression  Lesson 1  Introduction
By Andy McFadden


There's something almost mystic about data compression.  The ability
to take a file full of data, reduce it to half its original size, and
then restore it to its former glory has a magical quality to it.  Even
after the various algorithms have been read and digested, the question
remains: why is this even possible?

-=*=-

I've been fascinated with data compression for the last few years,
beginning when I watched an early version of ShrinkIt.  It compressed
files more tightly than BLU/SQ3 (the previous standard, for those of
you who don't remember Life Without ShrinkIt), but worked its magic in
half the time.  The key was the wondrous LZW algorithm, which I'd
never heard of before.

I did what you'd expect: I located a copy of the sources for UNIX
compress, which also used LZW.  It was written in C, so I figured it'd
be easy going.  Well, compress v4.0 was a single file 20 pages long
which violated a few programming conventions, had lousy structure, and
what comments it had were directed toward people who already knew how
LZW worked.  Being new to C, I didn't get very far.

I imagine a lot of people have ended up in a similar position, unable
to figure out WHY a compression program is doing what it's doing.
Compression textbooks are generally unhelpful, since they are usually
filled with information theory and funny symbols which don't appear on
a standard keyboard.  It's hard to write code from a description like,
"uses a dictionary scheme to encode previously seen substrings as
variable width codes."

As it turns out, there's a reason why so little is said about the
implementation of the more common algorithms.

They're utterly simple.

-=O=-     OVERVIEW

I have two goals for this course: (1) provide a nice fat library of
algorithms to code-hungry hacker types, and (2) provide a moderately
deep understanding of how and WHY the algorithms work.  I will try to
break sections cleanly, so that people out to snarf code can grab it
and run without having to wade through long discussions about Why Andy
Thinks This Stuff is Really Cool.

Most lessons will begin with some theoretical musings, have a middle
section with the discussion of the algorithm du jour, and an end
section with more Deep Thoughts and idea on extensions, improvements,
enhancements, and possibly lead-ins to next weeks topic.  This week's
material is mostly fluff, so algorithm-only types should skip to the
sample code.

(There are no quizzes or exams.  Ignore all you want, I'll type more.)

I will judge this class a success if, having read all of the material
I've written, a student walks away thinking, "how trivial."  If it
still looks like voodoo and mirrors when it's all over, you get a
refund (all $0.00).

I'm trying to refrain from posting a big "PROGRAMMERS ONLY" sign on
this course, but let's face it, what I'm telling you won't do you much
good if you don't have some basic programming skills.  The deeper
knowledge of compression may be good for your immortal soul, but be
forewarned that much of this course is concerned with application
rather than theory, and is described in a way that programmers
familiar with bit-twiddling will understand.  For pure theory, see the
list of references at the end of this lesson.

That said, don't be intimidated.  There's a real-time conference for
discussing the confusing bits, and usually things which don't make
sense become clearer after you read them twice.

Live working code will be provided in C and/or assembly.  I will try
to provide both - C for understanding and assembly for usefulness -
but I make no guarantees.  You WILL need an assembler or a C compiler
to try the examples, and while I will try to make the C examples
portable you will probably want to run them on an Apple II.

I've been using APW C for a few years and I'm perfectly happy with it,
so C programs will be written in K&R C.  I'll throw in ANSI prototypes
for Orca/C fans.  Assembly will be for the Orca/M assembler.


A list of what this course covers is given in the syllabus.  There are
a number of interesting algorithms which this course does NOT cover,
largely because I don't know much about them.  This includes
specialized algorithms for graphics and sound (JPEG, JBIG, MPEG), and
some routines which take forever (PPMC).  I'm not going to spend much
time on "lossy" compression, which employs schemes that achieve better
compression by ignoring selected portions of the input data.

-=*=-

How does data compression work?

It's easy to explain it in terms of specific algorithms.  RLE combines
long runs of characters into a single code (so "aaaabccc" becomes
"a4bc3").  Huffman replaces more frequent characters with short codes
and less frequent characters with longer ones (so "e" would take up
two bits instead of 8, while "q" would take 12).  LZW replaces entire
strings with codes representing previously seen strings (so "face the
faces" becomes "face the 1s").  However, there doesn't seem to be a
common idea running through all of these schemes.

There is, and it's a simple one.  All of them try to guess what comes
next, and then output how surprised they were.

Huh?

Look at each one in turn:

- RLE sees a character.  It's rather simple-minded, so it says to
itself, "hey, I bet *ALL* of the rest are just like this one!".  If
the next character is identical, it stores nothing, because it wasn't
surprised.  If the next one is different, it just stores the character
without compressing it at all.

- Huffman assigns probabilities to every possible character.  If it
sees a common one, like 'e', it's only mildly surprised, so it outputs
a few bits.  If it sees a rare one, like '^', it's flat out
flabbergasted, and outputs a whole lot of bits.

- LZW wants to believe that the input is composed of strings it has
already seen.  If they are, it's mildly (but pleasantly) surprised,
and outputs a code which can represent dozens of bytes.  If it
encounters a string it hasn't seen before, it becomes rather shocked,
and the code it outputs may be longer than the string it represents.

[ Quick aside: this is why random data doesn't compress (impossible to
predict what comes next), and why already compressed data rarely
compresses further (most of the surprise has been removed). ]

What makes one algorithm different from another is the way it tries to
predict what comes next.  Like astrologers and fortune-tellers, the
value of a data compression algorithm is measured in terms of how well
it can predict the future.

You may have a question nagging you at this point: we're compressing
files, not playing the ponies; why can't we just scan the whole thing
and find the optimum way to pack it?

First of all, finding an optimal way to pack something is an
NP-complete problem (if you don't know what that means, substitute,
"it takes a bloody long time").

Second, the information you gathered while scanning the file must be
stored somewhere.  After all, you won't have the uncompressed file
sitting there when it's time to uncompress.  If you gather a lot of
information, it must be encoded in the output somehow, and this can
occupy so much space that the algorithm performs poorly.

The bottom line is, compression by blind prediction usually does about
as well as scanning the whole thing, so it's usually not worth doing.
There will be some examples of this sort of thing later in the course.

Anyway.  What's important is to view data compression as a way of
reducing data to just those things which are unexpected.  It's a
little spooky at first, because a really, REALLY good compression
algorithm would be able to write my next sentence for me, given enough
information about the words I use and what I've said so far.  In fact,
by storing collections of word pairs, it could type every other word
for me, and only store data when it guessed wrong.

Of course, I can always snarfugle it deliberately.

-=*=-

One of the textbooks I really like (_Text Compression_, by
Bell/Cleary/Witten) makes a very clear distinction between the "model"
and the "encoder".  The model figures out a set of probabilities for
what comes next, and hands that to the encoder.  The encoder takes
that information, reads the next piece of data, and then outputs a
value that reflects how accurate the model was (i.e. how much surprise
was contained in the data).

Huffman is a good example of an encoder.  Typically, a program using
Huffman takes a set of input symbols (usually the 256 characters you
can represent with a single byte) and a set of probabilities (which is
just the number of times each character has appeared), and produces an
output code for each.  In this case, the "model" is simply "how many
times have I seen this character before", and the "encoder" is
Huffman's algorithm.  However, it's possible to use Huffman as the
encoder for a more powerful model, gaining better compression.

For example, the model could more accurately predict what the next
character will be based on what the LAST character was.  The letter
'q' is most often followed by the letter 'u', so if you just saw the
letter 'q' then 'u' should have a higher probability than 'e', even
though 'e' appears more often in English text.

Models don't have to be simple things like character probabilities.
Suppose my model is "the full text of _Neuromancer_".  If I am told to
compress an exact copy of the book, I'm 0% surprised, and output
nothing.  If even one character is off, I'm 100% surprised, and must
output the entire input stream plus one bit.  Clearly this is an
efficient way to distribute the book, but it isn't good for much else.

(Going back a few paragraphs, we can find the optimum compression for
ANY book by sending the whole thing first - thus defining the model
for both sender and receiver - and then send a transmission 0 bits
long.  The first step is a bit rough though.  This is a somewhat
exaggerated justification for processing the data as it comes in
rather than analyzing the whole file at once: you don't need to send a
whole bunch of stuff before you can send useful information.)

-=*=-

One very interesting thought process for assessing the value of a
model is to stop it at some point and run it *backwards*.  The idea is
that a really good model will generate output which looks like the
data it's trying to model.  You can do this in practice by feeding
random data to a decoder, but for now just try to picture how the
output "feels."

Let's check some examples against English text:

- RLE will generate long strings of identical characters.  This
doesn't really look like English.  Not surprisingly, RLE does poorly.

- Huffman will generate characters at random, but correctly
proportioned for English.  It'll look like garbage, but there'll be
lots of 'e's, 'a's, and 't's.  Huffman does reasonably well.

- LZW will spit out pieces of words, whole words, parts of sentences,
whole sentences, and possibly even entire paragraphs.  It may not make
much sense, but it sure looks like English.  Thus, LZW compresses very
well after it reads a lot of input.

-=*=-

In the beginning, there was nothing.  Then the stars and planets
formed.  Then we showed up.  Bummer.

I'm trying to avoid excessive quantities of fancy terminology and
assorted nonsense in this course, but there are a few terms which come
up often and need to be explained.

The first paragraph is an example of negative entropy, order from
chaos.  In the physical sciences, the "entropy" of a system is a
measure of the disorder within it.

This term was borrowed for information theory, and has essentially the
same meaning.  The entropy of a collection of data is a measure of the
amount of disorder.  Turned around, it's a measure of how much
information is contained in a message.  If we're expecting
_Neuromancer_ and we get it, there's no information there - we were
expecting it.  If the sender passes a copy of _Ringworld_, there's a
lot of information that we didn't have.

Going back to an earlier section, entropy is a measure of how
surprising the input data is.

The most interesting result we can derive from this is that, if we can
compute the entropy of the source data, we can figure out what the
smallest possible size of the compressed output data is.  However, the
amount of surprise contained in the input data depends on who is
looking at it (LZW is surprised less often than RLE when handling
English text), so entropy of a message must be computed relative to a
particular model.

(Note that there IS an absolute entropy for a given set of data.  For
example, there is a point at which English text can not be compressed
further.  Unfortunately, nobody knows what it is yet.)

In case I haven't confused you enough, I'll throw in one more tidbit:
just because the MODEL can produce appropriate symbols and
probabilities for a message doesn't mean that the ENCODER can produce
output in a way which matches the entropy of the message with respect
to the model.  We'll get back to this when we discuss Huffman and
arithmetic coding.

Well, back to reality.  Some compression programs will say something
like, "entropy was 2.85 bits/symbol", meaning that, for the model
used, each 8-bit character in the file had on average 2.85 bits of
surprising information.  The number of bits per symbol is often given
when comparing algorithms.

For English text, LZW gets around 4.71 bits/char, LZH gets 3.30, and
PPMC (which requires large amounts of time and memory) gets 2.48
bits/char [information comes from appendix B of _Text Compression_].
If you can come up with an algorithm which drops below 2 bits/symbol,
you will become very popular in a short period of time.

-=O=-     PROGRAM

This section will discuss the algorithm(s) for each week's programs.
Code-only types can start here.

This week we're just getting started, so I decided to do something
kind of silly.  It's a compressor which takes a text file and reduces
it by 12%.  The trick is to take an 8-bit byte and store it in only 7
bits, which is all ASCII really needs.

Of course, if it finds data 8 bits wide, it has to abort.  Thus it
will only work on pure text files.  It stores the 4-byte length of the
input file at the start of the output, because otherwise we could be
off by one if the last byte has only 1 bit in it (the other 7 might be
used or they might not).

This isn't supposed to be an incredibly useful tool, but rather
something to get everybody warmed up.  It uses the compression "shell"
and bit-shift operators in C and assembly, so that problems relating
to your assembler or compiler will be identified early, and everyone
who isn't familiar with C and get a look at a simple example.

There are two copies of the program, one in C, one in assembly.  The C
program has two source files, "cmain.c" and "cp7.c".  "cmain.c"
provides a simple user interface and will be used throughout the
course.  "cp7.c" has two functions, "encode()" and "decode()", which
do pretty much what you'd expect.  Both operate on a stream of data,
using the standard I/O functions getc() and putc().

The assembly version has "main.asm" and "p7.asm", plus the usual
assortment of macro files and other junk.  There's some stuff for
GSBug users there too.  It works in an almost identical fashion to the
C version, except that getc() and putc() are defined in "main.asm".

There's one slight difference in the way they handle arguments.  Under
Orca, ".." must be expanded using the shell wildcard expansion calls.
This works when you specify an input file, but not when you specify an
output file which doesn't exist.  The C code (pulled from NuLib)
correctly handles this, but the assembly code (pulled from YankIt)
doesn't.

The way to use either version is:

     program e infile outfile           to compress (encode) a file
     program d infile outfile           to extract (decode) a file

These are rather simple minded, and don't do nice things like checking
to see if the output file exists or not (I've found that such a
feature just tends to get in the way more often than not).  If you
want to spruce them up, feel free to do so... as far ahead as I can
think, all the code will use the same "cmain.c" and "main.asm" files.

When they finish, they will print the size of the file before and
after compression.  For encoding, it will print the percent
compression (where no compression == 0%, reduction by half == 50%, and
expansion by half == -50%) and the number of bits per input character.
For this week you would expect it to be exactly 7, but because we have
to store the length of the original file it will be slightly more.
Another contributing factor is that the last byte in the output may
not have all 8 bits filled.

-=O=-     DEEP THOUGHTS

Fun stuff to think about:

- for this week's example (the 8->7 bit compressor), what is the
model?  What does it expect?  What surprises it?  What would it
generate if you ran it backward?

- an encoder is given the probabilities for a set of symbols.  What
happens if the input symbol isn't in the set?  i.e. we're told there's
a 50% chance of 'a' (so we'd output a 0) and a 50% chance of 'b' (so
we'd output a 1), but we get 'c' instead.

- suppose something really surprising happens.  Is there a way to
avoid storing the full measure of surprise?  What happens to the data
when we uncompress it?

- in the case where the model was the full text of _Neuromancer_ and
the input didn't match, I had to output the entire input, plus one
bit.  Why the extra bit?

- running models backward suggested that LZW outperformed Huffman on
English text.  For what sort of input will Huffman outperform LZW?
(Recall: Huffman generates short codes for frequent characters and
long codes for infrequent characters, while LZW replaces repeated
strings, e.g. all instances of " the " would be replaced with a code).

- an interesting result from information theory is that if a
compressor can make some set of input data smaller, then there exists
a set of input data which it will make larger.  Convince yourself that
this is true.

- suppose we're in a situation where we CAN'T abort the compressor.
How can we change the 7-bit compressor to handle short bursts of 8-bit
data?

-=O=-     REFERENCES

First, here's the bibliography from the Frequently Asked Questions
List for the Usenet newsgroup "comp.compression" (reformatted
slightly):

-----
Subject: [7] Which books should I read?


[BWC 1989] Bell, T.C, Witten, I.H, and Cleary, J.G.  "Text
    Compression", Prentice-Hall 1989. ISBN: 0-13-911991-4. Price:
    approx. US$40  The reference on text data compression.

[Nel 1991] Mark Nelson, "The Data Compression Book"
    M&T Books, Redwood City, CA, 1991.  ISBN 1-55851-216-0.
    Price $36.95 including two 5" PC-compatible disks bearing
    all the source code printed in the book.
    A practical introduction to data compression.
    The book is targeted at a person who is comfortable reading C code
    but doesn't know anything about data compression.  Its stated goal
    is to get you up to the point where you are competent to program
    standard compression algorithms.

{which is more or less what this course is all about!}

[Will 1990] Williams, R.  "Adaptive Data Compression", Kluwer Books,
    1990.  ISBN: 0-7923-9085-7. Price: US$75.
    Reviews the field of text data compression and then addresses the
    problem of compressing rapidly changing data streams.

[Stor 1988] Storer, J.A.  "Data Compression: Methods and Theory",
    Computer Science Press, Rockville, MD. ISBN: 0-88175-161-8.
    A survey of various compression techniques, mainly statistical
    non-arithmetic compression and LZSS compression.  Includes
    complete Pascal code for a series of LZ78 variants.

[ACG 1991] Advances in Speech Coding, edited by Atal, Cuperman, and
    Gersho, Kluwer Academic Press, 1991.

[GG 1991] Vector Quantization and Signal Compression, by Gersho and
    Gray, Kluwer Acad. Press, 1991

[CT 1991] Elements of Information Theory, by T.M.Cover and J.A.Thomas
     John Wiley & Sons, 1991.

Review papers:

[BWC 1989] Bell, T.C, Witten, I.H, and Cleary, J.G.  "Modeling for
    Text Compression", ACM Computing Surveys, Vol.21, No.4 (December
    1989), p.557  A good general overview of compression techniques
    (as well as modeling for text compression); the condensed version
    of "Text Compression".

[Lele 1987] Lelewer, D.A, and Hirschberg, D.S.  "Data Compression",
    ACM Computing Surveys, Vol.19, No.3 (September 1987), p.261.
    A survey of data compression techniques which concentrates on
    Huffman compression and makes only passing mention of other
    techniques.
-----

Now, a few comments about the books I've seen...

The first data compression book I bought was _Text Compression_, by
Bell, Cleary, and Witten.  If you are interested in theory stuff,
chapter 2 is a *wonderful* discussion of entropy, models vs encoders,
etc.  I had to force myself to leave the book on the shelf out of fear
of inadvertently plagarising the whole thing.  All of the material is
presented in a way which is both mathematically precise yet
decipherable by people who don't have Ph.Ds in the field.  The
definitive paper on arithmetic compression was written by them, and
the chapter which discusses it is very thorough (and does in fact
contain source code).

The second book, _The Data Compression Book_ by Mark Nelson, was
purchased after finishing most of this first lesson.  I flipped
through the table of contents and was surprised to see that it was
nearly identical to my proposed syllabus!  The book is very similar to
this course, in that it presents algorithms and gives actual code for
them.

So the question is, should you buy this book?  As always, "it
depends."  Most of what's in the book will be covered in this course.
His examples were written for an IBM PC, but should be portable to
other platforms.  However, they are generally written in a heavily
structured way which matches pseudocode closely but is less efficient.
I actually found them harder to read, because once you understand what
is going on you don't want to be flipping between a dozen separate
routines.  I believe that his examples are easier to decipher the
first time around (ASCII art illustrations can only go so far), and
his descriptions will be easier to understand if you're relatively new
to programming.

As an added bonus, the book covers lossy image compression (JPEG),
which will not be covered in this course.  It has one of the most
readable descriptions of DCT stuff that I've seen (about 60 pages
long), so if you're interested in lossy compression, you may want to
check it out.

Before you sign off GEnie and never come back, there WILL be things in
this course which he doesn't cover, not to mention source code in
assembly language.  I will also try to spend more time discussing both
theory and efficient implementation, the former because I find it
fascinating, the latter because it's absolutely vital on an Apple II.
I may also throw in a few interesting code tidbits. ;-)

-=!=-