HDC Lesson 5

Back to index
Hacking Data Compression  Lesson 5  Adaptive Encoding
By Andy McFadden
GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie


Today's lesson is on adaptive coding techniques.  We begin with a
justification of why it's a Good Thing, and drop into a discussion of
adaptive Huffman.  A description of the elusive splay tree compression
is also presented.

-=O=-     Greetings and Adaptations

If you followed the discussion in last week's lesson on Huffman
encoding, you should be familiar with some of the issues involving
adaptive, semi-adaptive, and static models and encoders.  To summarize
briefly:

- static schemes use a rigid set of probabilites, determined before
examining the data to be compressed.

- semi-adaptive schemes choose a set of probabilities, and use them
throughout the entire file.  The choice can be made by scanning the
file in advance or using some sort of file type mechanism to determine
what the contents should be.  The probability set or dictionary is
either stored in the program or transmitted before the data.

- adaptive schemes start out in a particular state (could start with
no knowledge at all or from a basic set of probabilities), and then
change the probabilities as the input as processed.

We've seen examples of all of these.  Differential encoding, the
static-dictionary digram compressor, and the MacWrite encoder are all
static; last week's Huffman encoder is driven by a semi-adaptive model
based on character frequencies; and the MTF scheme described in
lesson 3 is adaptive.

-=*=-

We have already seen that adaptive schemes can lead to cleaner
implementations (no tables to shuffle around, no extra scanning
passes, etc).  However, there are some other advantages.

In section 3.2 of _Text Compression_, Bell/Witten/Cleary state:

Given a set of conditioning classes with K members, an
alphabet with q symbols, and a message of length n, there
are adpative models that will take at most Kq log n bits
more than the best possible nonadaptive model.

This is probably not very meaningful, especially since we never really
went into what "conditioning classes" are.  I've presented it because
there are two interesting conclusions which can be drawn: first,
adaptive models will do only slightly worse than static models; and
second, Kq log n is approximately the number of bits needed to
represent a model of the message, so adaptive models will be about as
efficient as semi-adaptive models.

It can also be shown that adaptive models are "robust", meaning that
at worst an adaptive model will expand the input only slightly.  It
should be intuitive that you can make a completely static model do a
really lousy job (e.g. by feeding a file full of '%'s to a static
model of English text).

-=*=-

If I'm doing my job correctly, you should now be convinced that, at
worst, an adaptive model will do only slightly worse than a static or
semi-adaptive model.  However, there are cases where an adaptive model
will do considerably better than either.

Suppose I'm using some sort of multi-media format with text, graphics,
and sound all in one whopping big file.  If we suppose that none of
the data within the file is compressed (which is a silly assumption,
but hey, this is a contrived example anyway), then we have three sets
of stuff to deal with.

Suppose we run a static or semi-adaptive Huffman compressor on it.  A
static set of probabilities is going to gag on the sound and graphics,
but will do a nice job on the text.  A semi-adaptive compressor will
end up with a rather flat set of probabilities, because it will have
merged the frequencies from the text (assume English), the graphics
(which will likely tend toward a small set of values), and sound
(which looks almost random).

Since an adaptive Huffman compressor changes the probabilities as it
reads in characters, it will adapt to each section of the file in
turn.  This will allow it to compress each section with the best
possible values for the character probabilities.

This sort of adaption can also be useful in just plain text files.  If
you break away from normal writing to display ASCII cow art, the
distribution leans toward spaces and characters like '|', '/', '\',
and 'o'.  Static models just get bogged down, while adaptive models
can take advantage of the local changes.

-=*=-

One important question is, how quickly can (or should) the model
adapt?  If it adapts too slowly, it will miss out on local phenomena
like cow drawings.  Too quickly, and the probabilities will be
inaccurate, leading to poor compression.  The correct answer is, as
usual, "it depends."  For natural languages like English, you usually
want to base your predictions on the last few thousand bytes of data.

A very common way to increase the value of recently-seen data is to
"forget" old data.  This can be done with a "sliding window"
(discussed in the upcoming LZSS lesson), by explicitly weighting
recent events, or by "forgetting" old information.  In a model based
on character frequencies, we get a certain amount of forgetfulness if
we are restricting the size of the frequencies to prevent overflow.
For example, if we periodically halve the frequency counts, we are
essentially cutting the weight of previous data by 50%.  The next time
we cut it, the original statistics are reduced to 25%, and so on.

This isn't always a good thing - local changes might be temporary -
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
but for most kinds of data, recent events say more about what to
expect next than events which happened near the start.

-=*=-

All is not sunshine and roses in the adaptive compressor world.  There
is one major down side to adaptive schemes: the cost of updating the
model at every step.

The exact cost varies - Huffman is fairly expensive, LZW is almost
free - but it's never zero.  There may be additional speed penalties
depending on the implementation; for example, last week's Huffman
encoder stored the bit patterns in an array while it was encoding.  An
adaptive encoder is continuously changing the bit patterns, so this
method is no longer feasible.

As we saw last week, it is wise to limit the frequencies to some
value, so that the frequency table and output codes can be implemented
efficiently.  However, we also saw that this meant slashing all of the
frequencies in half whenever the total frequency reached a certain
value.

At first, this doesn't seem significant; ALL of the frequencies have
been cut in half, so the tree doesn't need to change.  Life, however,
is unfortunately not so simple.  Because of round-off errors, it's
possible for the balance to shift, requiring nodes to exchange
positions.  Because there's no easy way to adjust the tree to
compensate, it's necessary to rebuild the entire tree with the new
frequencies.

-=O=-     Splay trees

Let me explain something up front.  Very little has been written about
splay trees in compression texts.  The only place I know of that
discusses them is "Application of Splay Trees to Data Compression", by
Douglas W. Jones, Communications of the ACM, August 1988, and I've
never even read THAT.

What I DO have is a piece of code which I pulled off the net about
four years ago and managed not to lose.  Everything I know about splay
trees comes from deciphering that code.

Why bother?  Why are splay trees interesting?  Because they can
approximate an adaptive Huffman encoder without ANY frequency counts
or other garbage.  All of the information is contained in the shape of
the tree itself.

It turns out that the cost of updating the tree is about equal to
adaptive Huffman, and the compression performance is slightly worse,
so there's really nothing to gain from using them (consequently,
there's no assembly code for them in this week's lesson).  There is
one big loss however: since there are no frequency counts, there is no
way to control the maximum height of the tree (at least, not without
maintaining some expensive auxilliary data structures).  This means
that the maximum code length is equal to N bits, where N is the number
of symbols.  Outputting 256 bits for a rare symbol might be efficient
coding, but it's a real drag to implement.

-=*=-

How do splay trees work?  Fine question.  It's actually quite a bit
like Huffman, in that you have a binary tree, and output 0 if you go
left and 1 if you go right.  The only real difference is in the tree
update routine, which is remarkably simple.

Suppose you have a tree which looks like this:

                    a
                   / \
                 /     \
                b       c
               / \     / \
              /   \   /   \
              w   x   y   z

If you received the character corresponding to node w, you would swap
w and everything below it (i.e. w's "subtree") with node c's subtree.
Basically, you go one up and one over (the brother of the parent), and
swap.  So the result would look like this:

                    a
                   / \
                 /     \
                b       w
               / \
              /   \
              c   x
             / \
            /   \
            y   z

(If the next character you got corresponded to node c, you'd end up
switching things right back.)  Changes to the other side of the tree
are just reflected across... if node z got updated in the first tree,
it would be swapped with node b.  If node z got updated in the second
tree, it would be swapped with node x.

If the tree were taller (and proportionately harder for me to draw),
we would propagate the changes up the tree by updating the "local
root" node.  For example, when node z got updated in the second
example, it was swapped with x; b was the "local root", the parent of
the parent.  If node a had a parent, then b would've been swapped with
a's brother, and so on up the tree.

We stop when we hit the root or the level below the root.

-=*=-

Why does this work?  Why does it even come CLOSE to Huffman?  Fine
question.  From this example, it should be clear that when a character
appears, it bubbles to the top rather quickly.  Each swap moves the
node one level closer, and there can be more than one swap if the tree
is tall enough.  Because it's possible for nodes to "oscillate", it
seems likely that frequently seen characters will tend to "hover" in
the same region of the tree.  This means that frequently seen
characters will be given short codes, which is what we want.

I suspect that the reason this doesn't do as well as adaptive Huffman
is that it adapts too quickly.  Infrequently seen characters will end
up at the bottom, but will rise quickly when encountered.  This may be
useful in certain situations, but doesn't model English text very
well.  The problem is that you can only capture so much state in the
shape of a tree with 256 nodes.

Anyway.  I haven't played with this enough to give a solid analysis.
Read the CACM paper if you're really interested.

-=O=-     PROGRAM (Adaptive Huffman)

The model used is the same as last week's: an order-0 character
frequency model.  The data structures we will use look very similar,
but with some interesting twists.  The binary tree is again stored as
a collection of flat arrays.

First, we have the table of frequencies, with one entry for each
symbol and internal node.  Our set of symbols includes all 256
characters plus $100 for the stop symbol.  Recall from lesson 4 that
the frequency value of an internal node is the sum of the frequencies
of the children.  All of the leaf frequencies are initialized to 1 to
avoid the zero-frequency problem.

Next, we have the "parent" array.  At first, it has three distinct
levels.  The first level, from 0 to 256, represents the leaves; the
pointers all point to internal nodes.  The second level, from 257 to
512, represents the internal nodes (256 internal nodes for 257
characters, remember?)  The root of the tree is at the very end in
node 512.  The third level is actually BELOW the leaves despite being
above it in the array.  It has one entry for each character, and
points to the corresponding leaf.

When we start updating the tree, the leaves and internal nodes will
shift around the first two levels.  The pointers in the third level
will continue to track the leaves around, so that when we are encoding
symbols we know where to start, and when decoding we know when to
stop.  Remember, encoding walks up the tree, decoding walks down.

The "son" array holds pointers to the child nodes for every entry in
the first two levels.  The pointers in the leaf nodes point to the
third level, so that you can tell when you've hit the bottom of the
tree by looking at the size of the index in "son" (if it's larger than
the root, then you are about to reach the third level, and thus are in
a leaf).  Besides acting as a flag, it also tells us which character
the leaf represents.

Note that each node has two children; they are found at son[i] and
son[i] +1 (so the brothers are always adjacent).

This may all seem a bit confusing, but it leads to a remarkably
compact and efficient implementation.  Take a careful look at
StartHuff(), and figure out what is going where.  Make a note of the
two sentinel values in the frequency table and in parent.

-=*=-

encode() and decode() are pretty dull.  encode() just keeps calling
EncodeChar() with the character it gets, and decode() keeps calling
DecodeChar() and outputting what it gets back.

EncodeChar() is mildly interesting.  It starts at the leaf and walks
up the tree, pushing bits on as it goes.  The bits are pushed on at
the left edge and move to the right.  When it reaches the root, it
transmits the bits.  This is just doing what the static Huffman coder
did last week, but here we have to do it every time instead of
computing it all at once.

DecodeChar() is also only mildly interesting.  Since we're decoding,
we take the input one bit at a time, and walk from the root down until
we find a leaf.  The character we return is determined by the tree
position.  As you look at the code, remember that 'T' is the size of
the tree, so c+T is the position on the third level (the one which is
effectively below the leaves).

-=*=-

The update() routine does a moderate amount of work for each
character.  However, after the first 32K (or so) characters, and
roughly ever 16K thereafter, the tree will be entirely rebuilt.  I'll
explain that part in a bit.  (Note that the frequency of an internal
node is equal to the sum of the frequencies of its children.  If you
carry this up the tree, you will see that the frequency of the root is
equal to the sum of the frequencies of the leaves.  Thus, we don't
need to maintain an extra "total frequency" counter; the value is in
the root.)

In the average case, the frequency of the leaf for the character we
just got must be increased.  However, since the parent's frequency is
the sum of the frequencies of its children, we need to propagate the
change all the way up the tree.  Also, since we need to change the
shape of the tree adaptively, we have to swap the node if its
frequency exceeds that of its neighbors.

The code is sort of strung out, but it's basically this:

     (1) increment the node's frequency
     (2) if we're now larger than our right neighbor, then
     (2a) find the largest node that we are still larger than, and
     (2b) swap with it
     (3) continue up the tree

When the node moves, it doesn't take its children with it.  The two
nodes are detached, swapped, and reattached.  The reason for doing the
search instead of just swapping is that there could be a span of nodes
with the same frequency.

-=*=-

The reconstruction routine is also pretty simple, though the code is a
bit messy.  It starts by taking all of the leaf nodes and collecting
them on the first level of the tree.  While we do this, we cut the
frequencies in half, taking care that none of them reach zero
(this uses the same (val+1)/2 method as last week).

So now we have the entire first level of "freq" and "son" set, and the
"prnt" array is pretty much invalid.  The next step is to set the
internal nodes while rebuilding the tree according to the frequencies.
The loop uses a trick from StartHuff(), where we step through the kids
by two and the internal nodes by 1, ending up at the root at the same
time.  This uses the fact that the sons of node i are at i/2 and
i/2+1.

Anyway.  At each step, we look backward through the tree to see if we
already created nodes with higher frequencies.  If we find some (which
can be in a leaf or in an internal node), then we slide all
intervening nodes forward, and put the new node into position below
them.  This shapes the tree.  Is it the most efficient way to go about
it?  Maybe, maybe not; it depends on the hardware.  It's really very
simple once you see the algorithm through all the i's, j's, and k's.

The final step in the reconstruction is to set the "prnt" array to
reflect the new order.  This just steps through the tree, and sets the
parent pointer for the son of each node.  Note that internal nodes
have two kids, and leaves have one child (a pointer to the node on the
third level).

-=*=-

The assembly code supplied this week began its life as a direct port
from the C code, but got rearranged slightly for optimization.  I had
to "dumb it down" to make it readable for the class - the file was
more than twice as large as it is now because of unrolled loops.  I
did leave some of the fancy stuff in, along with some pointers for
optimizations.

Still, it'll be easier to understand the C code, so even if you don't
know C it might be best to have both in front of you.

One thing I had tried but gave up one was pre-loading the tree with a
set of frequencies.  The idea was that, for English text, it would
improve compression and reduce the number of adjustments that would
have to be made to the tree.  It turns out that this implementation
adapts quickly to the input, so very little was gained in terms of
compression ratio.  As far as speed goes, the initial sorting takes
about as long as it does to shift the tree around, so there was little
gain there.  (This could be fixed by pre-sorting the static frequency
table, but you're still doing a reconst() at the start, at 16K, and at
32K, whereas in the normal case you do the first reconstruction at
32K.  That's two extra reconst() calls, which are not trivial.)

In case you want to play with it, I left the code and a table for
English in the source file, AIFed out.

-=*=-

There's a file in the "Test" subdirectory called "Zippy1.1.b".  It's a
ShrinkIt archive packed in BinSCII format (which is similar to
uuencode or Mac BinHex format).  Basically, BinSCII takes 8-bit binary
data and expands it to occupy 6 bits of every byte.  This is useful
for posting to Usenet or sending through mail systems which don't like
8-bit data (stuff like control characters or even stray spaces may get
eaten, destroying the contents of the file).

If you try to pack the file with ShrinkIt, you won't accomplish much.
LZW rarely compresses BinSCII data by more than 10%.  On the other
hand, Huffman does fairly well.  Try it and see.  Be sure and see how
many bits per symbol the output requires, keeping in mind that BinSCII
uses 6 bits out of every byte.

-=O=-     PROGRAM (Splay Tree)

As stated earlier, only C code has been provided.  I did provide the
splay() routine itself in assembly, as "splay.fragment".

There are three arrays, "left", "right", and "up", which contain the
left child, right child, and parent pointer for each node.  We have
257 leaves and 256 parents, but the tree is upside down compared to
the Huffman trees we've been building: the root is in entry zero, and
the leaves begin at 257.  The initialize() routine just sets it up as
a fully balanced tree.

The encode() and decode() routines work just like they did for
Huffman, except that in this case encode() puts the bits into an
actual heap array, since the size isn't limited to something nice like
16 or 32 bits.  The heap is declared with 514 entries (which
corresponds to the size of the "up" array), but I don't think it can
go past 257.  The encode loop stops when it hits 0 (the root), and the
decode loop stops when it finds a child pointer larger than 257 (a
leaf).  This routine also uses an explicit stop character.

-=*=-

The splay() routine itself is a straightforward implementation of the
algorithm described earlier.  It's kind of icky and uses 1-character
variable names, but it really isn't doing anything fancy.  There may
be a more efficient way to implement this, but it translates fairly
well to assembly as it is.

The "a = c" at the end is just a polite way to terminate the loop.
Dunno why the author did it that way, but there it is.  The "handle
odd node at end" comment doesn't make sense to me, either.

-=O=-     DEEP THOUGHTS

- Last week's semi-adaptive Huffman compressor halved the
probabilities whenever the total reached about 4K.  If you have a
large file with character distributions which change periodically
(like the text/graphics/sound example in today's lesson), will the
probabilities accurately reflect the distribution in the file as a
whole?

- The loss in speed comes from two problems: (1) updating the tree at
each step, and (2) the inability to pre-compute bit patterns for the
encoder.  Is it possible to reduce the effects of either or both of
these?

- Would the speed be improved by limiting the frequency to 4181 as we
did for the static Huffman encoder?  (HINT: what happens when we hit
4181?)

- For what kinds of input would Splay encoding do well?  Will it ever
do better than adaptive Huffman?

-=!=-