[fadden-logo]

Back to index


Hacking Data Compression  Lesson 4  Shannon, Fano, and Huffman
By Andy McFadden
GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie


Today's lesson babbles on a bit about some theory stuff, describes the
algorithm that Shannon and Fano came up with, and then explains
Huffman's improved algorithm.  It closes with a discussion of whether
or not Huffman encoding is optimal or not.

The code du jour is a semi-adaptive Huffman encoder, which requires a
slight change to the way we've been compressing stuff (need to make
two passes through the file).  Most of this lesson is spent explaining
how the Huffman code works.

(If you've been following the theory side of the course, you may feel
like you've been involved in a ping-pong match, because I keep
bouncing around between models and encoders.  Most texts will start
with basic theory, then describe encoding, and then go on to
higher-order models.  I decided to lay out the lessons according to
algorithmic simplicity instead, which makes sense given that this is
about "hacking" data compression.)

-=O=-     From Probabilities to Bits

The section header (stolen from _Text Compression_ chapter 5) is a
concise description of what "encoding" is all about: taking a set of
probabilities and converting symbols to bits in the output stream.

That's all well and good, but how to go about it?  From what we've
seen so far, we know that we want to output very few bits for probable
symbols and correspondingly more bits for improbable ones.  The
encoding "challenge" is to assign bit sequences which exactly match
the probabilities.

Way back in the '40s, Claude Shannon (often regarded as the father of
modern information theory) began exploring the problem.  He reached
the following conclusion: given a source with entropy H bits/symbol
running through a channel with capacity C bits/second, (1) the
transmission rate cannot exceed C/H symbols/second, and (2) you can
create an encoder which gets arbitrarily close to this limit.

Part 1 should not be a revelation at this point.  If the entropy of a
text file is 7 bits/symbol (using the 7 bit packer from lesson 1), and
you can send at 700bps, you can send at most 100 characters every
second.  Sounds pretty obvious.  What's neat is that it can be proven
mathematically, which means that it is at least POSSIBLE for an
encoder to output data which comes arbitrarily close to the entropy
(i.e. the encoder is optimal).

-=*=-

This nice result doesn't leave us with an algorithm, however.  So,
Shannon proposed the following.  Since it was discovered at roughly
the same time by R.M. Fano, it's known as Shannon-Fano encoding.

     1. List all possible messages, with their probabilities, in order
        of decreasing probability.
     2. Divide the list into two parts of roughly equal probability.
     3. Start the code for those messages in the first part with a 0
        and for those in the second part with a 1.
     4. Continue recursively until each subdivision contains just one
        message.

This is sort of murky, so let's go through an example.  To simplify
things I'm going to use character frequencies instead of symbol
probabilities.

Suppose we have the string "aabbbcddef".  If we did a frequency scan,
we would get:

     'a' - 2
     'b' - 3
     'c' - 1
     'd' - 2
     'e' - 1
     'f' - 1

For step 1, we sort the list:

     'b' - 3
     'a' - 2
     'd' - 2
     'c' - 1
     'e' - 1
     'f' - 1

For step 2, we cut it into two pieces with equal probabilities (here,
equal frequencies):

     'b' - 3
     'a' - 2        total = 5

     'd' - 2
     'c' - 1
     'e' - 1
     'f' - 1        total = 5

For step 3, we assign the first bit of the code:

     'b' - 3   0
     'a' - 2   0

     'd' - 2   1
     'c' - 1   1
     'e' - 1   1
     'f' - 1   1

For step 4, we repeat what we just did, first on the top list, then on
the bottom.  The top list splits once more, but the bottom list splits
into two lists of two items, which will each have to be split.
Eventually we will end up with lists with only one element in them,
like this (note that the binary codes are in ascending order):

     'b' - 3   00
     'a' - 2   01
     'd' - 2   100
     'c' - 1   101
     'e' - 1   110
     'f' - 1   111

As you can see, items with high probabilities tend to end up in a list
by themselves faster than items with low probabilities, because there
are a larger number of low-probability items in the bottom list.
However, it's not perfect, and occasionally it will assign a longer
code to a character with a higher frequency.

You should note that these codes have the property that no code is
found in the prefix of another.  For example, the codes "000" and
"010" would be illegal, because codes "00" and "01" exist.  This is
important so that we know where a code ends when reading a continuous
stream of data... the string "aabbbcddef" would be encoded as
"0101000000101100100110111" (check it and see).  If we allowed both
"00" and "000", we wouldn't know if we should stop after the second
zero or the third.

-=*=-

This scheme does fairly well, but it doesn't get as close to the
entropy of the data as we would like (for that matter, neither does
Huffman).  Still, Shannon was able to prove that you COULD get
arbitrarily close to the entropy with this scheme, by defining things
carefully.  If you've taken enough physics classes, this should sound
familiar.

If you look back at step 1, you'll see that it says to list all
possible "messages".  What's a message?  In our case, it was a single
character.  If you allow a message to be arbitrarily large, then
Shannon-Fano coding will become very close to the entropy.  (Recall
the example where we sent _Neuromancer_ with a single bit.)  In fact,
as the size of a "message" approaches infinity, Shannon-Fano codes
approach the entropy of the source input.

Of course, enumerating an infinite number of messages could take a
while, so the theoretical result is difficult to implement.

-=O=-     Huffman

A little while after Shannon produced these results, D.A. Huffman came
up with an algorithm which didn't have the occasional incorrect code
length problem seen in Shannon-Fano.  The idea is to start with a
couple of codes and expand stuff instead of starting with all the
codes and dividing them up.

The algorithm is:

     1. List all possible messages with their probabilities.
     2. Locate the two messages with the SMALLEST probabilities.
     3. Replace these by a single set containing them both, with a
        new probability equal to the sum of the two probabilities.
     4. Repeat until the list contains only one member.

This looks a lot like the Shannon-Fano method.  Again, let's run
through an example.  Using the same set as last time, the sorted list
after step 1 is:

     'b' - 3
     'a' - 2
     'd' - 2
     'c' - 1
     'e' - 1
     'f' - 1

For step 2, we just grab the two at the bottom of the list, 'e' and
'f'.  For step 3, we combine them and add their probabilities:

     'b' - 3
     'a' - 2
     'd' - 2
     (e f) - 2
     'c' - 1

Note that I'm throwing in step 1 (sorting) as well.  Repeating step 2
and 3, we merge 'c' with (e f) and get:

     'b' - 3
     ((e f) c) - 3
     'a' - 2
     'd' - 2

Note that we are retaining the set boundaries as we go, and that we
aren't generating any codes yet.  By the way, this is easy to do in
Lisp.  Anyway.  Repeating steps 2 and 3, we merge 'a' and 'd':

     (a d) - 4
     'b' - 3
     ((e f) c) - 3

Again we merge the two with the lowest frequencies, 'b' and (c (e f)):

     (b ((e f) c)) - 6
     (a d) - 4

And finally, we merge the two big honking sets:

     ((b ((e f) c)) (a d))

Now we need to assign codes to these.  The way to do this is to set up
a binary tree, with the left element in each set taking the left
branch, and the right element in each set taking the right branch.
Then we just define a left branch as a 0 and a right branch as a 1.
Here's what the tree for this set looks like:


                              (root)
                                 |
                         0       |       1
                +----------------+----------------+
                |                                 |
          0     |     1                        0  |  1
     +----------+----------+                 +----+----+
     |                     |                 |         |
     |                 0   |   1             |         |
     b              +------+-------+         a         d
                    |              |
                 0  |  1           |
               +----+----+         c
               |         |
               |         |
               e         f

Hope that's reasonably clear.  (I tried it with diagonal lines but it
got ugly.)  By traversing the tree, we see that the codes are:

     'b' - 3   00             Shannon-Fano:  00
     'a' - 2   10                            01
     'd' - 2   11                            100
     'c' - 1   011                           101
     'e' - 1   0100                          110
     'f' - 1   0101                          111

Comparing the two sets doesn't tell us all that much, but it's
interesting to note the differences.  It turns out in this particular
case that the results will be identical, but in general Huffman will
do better.

-=*=-

Huffman's algorithm generates "minimum-redundancy" codes, which
produce the shortest average code length for a given probability
distribution.  This suggests that Huffman coding is optimal.  However,
it's only optimal in the same conditions as Shannon-Fano: as the size
of the messages approaches infinity, Huffman encoding approaches the
entropy of the source.

At a less mathematical plane of reasoning, you can see that there is
one basic flaw in both of these schemes: they always output an
integral number of bits.  This means that, no matter how accurate our
probabilities are, we are limited by the fact that the output must
either be n bits or n+1 bits.  In effect, we are truncating the
probabilities to the nearest power of 2.

A related problem is that these schemes ALWAYS output at least 1 bit
per symbol (because we round UP at 0).  No matter how probable an
event, it will always get rounded to 0.5 (0 or 1 is a 50/50 chance)
unless it is the ONLY event.  Our intuitive notion of encoding
suggests that there is no way around this... after all, how can you
encode something in less than one bit?

Arithmetic coding can.  One semi-amazing demonstration shows how you
can compress several bytes into a single bit.  Stay tuned for
lesson 6.

-=O=-     PROGRAM

There are a remarkable variety of ways to implement Huffman encoding.
The one I'm going to present is from Sedgewick's _Algorithms in C_ (a
book which holds most every algorithm I've ever heard of).  There are
different ways of doing the data structures, different ways of
searching and building trees, and a variety of methods for storing the
tree in the output.  Before we get into a description of it, there are
some implementation issues which need to be considered.

First of all, this is a practical compressor, not simply an encoder,
so we need a model to drive it.  For our purposes, we will use a
character-level order-0 model (i.e. a character frequency count) to
define the probabilities.  (Aside: think about how much sense this
paragraph would have made to you four weeks ago!)

This presents a problem when fitting the model in with the code from
previous lessons.  Before we can compress the file, we need to know
the probability distribution of the characters in that file.  Which
means we have to read it once to get the distribution, and then a
second time to do the compression.  This requires a minor modification
to main.asm and cmain.c, and a new routine (called "scan") in the
compression source file.

This will be a semi-adaptive compressor, meaning that the compression
statistics are generated from the file to be compressed, and are sent
along to the decompressor.  Thus we will have to send the Huffman
table before the data itself, or enough data to reconstruct it.

We need some way of knowing when to stop.  We can use the length of
the source file or a distinguished stop symbol (e.g. $100).  I chose
to use a stop symbol because it illustrates the fact that Huffman is
not restricted to encoding bytes.

Since we want this to be fairly efficient, we would prefer to compute
the bit patterns for each symbol ahead of time.  We'd like to be able
to store these in a 16-bit byte for efficiency, but the codes can be
much longer (consider a fully degenerate Huffman tree... the length of
the code is equal to the depth in the tree).

Finally, for clarity in the assembly program and efficiency in both,
I'm representing the binary trees in a flat array.  The children of
node n are n*2 and n*2 +1.  The frequencies will be stored in one
array, the pointers to parents in another, and the pointers to
children in a third.

-=*=-

The first order of business is getting the frequency counts.  scan()
does the obvious and maintains a table of frequencies with one entry
for each character.  To prevent the values from overflowing, when a
frequency reaches MAX_FREQ we cut all the frequencies in half.

It should be noted that, since we have full knowledge of the input
file, we don't suffer from the zero-frequency problem.  If a character
never appears, we assign it a probability of zero and don't have to
worry about it showing up.  Of course, we have to make sure that every
symbol which appears has a frequency of at least one.  For this
reason, we increment each frequency before we cut it in half.

Note that there is a relationship between the frequencies and the
height of the tree.  Picture for a moment what a fully degenerate
Huffman tree would look like, starting from the bottom.  Give the two
leaves a frequency of 1 each.  This means the parent has frequency=2,
and the parent's brother (which is a leaf) must have a frequency of at
least 1.  The next parent up the tree has frequency 2+1=3, and HIS
brother has a frequency of at least 2.  (If it were 1, then it
would've been swapped with the node below with frequency=2).  So far
we have a tree which looks like this:

           5
          / \
         3   2
        / \
       2   1
      / \
     1   1

If you look at the pattern of the internal nodes, you will see that it
looks like this: 1 2 3 5.  This doesn't look like much, but if you
continue it you will get this:

     1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584   4181

If you've been out of college for too long (or never been), you might
not recognize it.  It's called a Fibonacci sequence.  Each entry n is
equal to (n-1) + (n-2), usually starting with "1 1".  This means that
a tree of height 17 - which requires 17 bits per output code - cannot
be built unless the sum of all frequencies is greater than or equal to
4181.  (When computing the number of bits, count the TRANSITIONS
between nodes, not the nodes themselves.  Bits are output when you
move between two nodes in the tree.)

So we have two choices: we can scale the frequencies so that their sum
is less than 4181, or we can just output the codes on a bit-by-bit
basis by traversing the tree.  Scaling the frequencies is annoying and
results in a less accurate model, but the whole encoding process is
greatly simplified if there is SOME limit.  For simplicity's sake, I
decided to scale it whenever the sum reaches 4181.

-=*=-

Before scan() returns, it has one more task to do.  The decoder will
need to have an exact replica of the tree we will be building, so we
need to put some information at the head of the compressed output.
There are two basic approaches.

The first is the simpler, which I'm going to use.  Just output the
table of frequencies, and have the decoder build the tree using the
same routines that the encoder uses.  This is very simple and
straightforward, but always uses 512 bytes (assuming 16-bit
frequencies).  You can reduce this with an idea from _The Data
Compression Book_, by doing an RLE-like encoding on it to remove
entries with a frequency of zero.  Of course, if the data uses all 256
values, this gains you nothing.

The second is to output some representation of the tree we're about to
build (note that it helps to build it first and output it afterward).
One technique is to output a list of the characters followed by a
count of the number of nodes in the tree at each level.  It takes a
bit of fancy footwork, but you can reconstruct the exact same tree if
it's done correctly.  Another is to actually dump the tree (like the
"child" array), which can save space if there are lots of
zero-frequency characters (like in text files).

Implementing a more efficient tree storage technique is left as an
exercise for the reader.  The code is complicated enough as it is.

-=*=-

Once we have the frequency table, the main program will rewind the
input file and call the encode() routine.  Our first task is to
construct a Huffman tree and then compute the bit patterns to use for
each character.

If you aren't looking at the code already, I suggest you read through
it while reading this.  It's also interesting to compare this to Mark
Nelson's code in _The Data Compression Book_; he uses some algorithms
which are easier to understand but considerably less efficient.

For example, we need to pull the two smallest elements out of the
array at each step.  Nelson uses a linear search, which has a worst
case of 128 * 256 operations (O(N^2)).  I use Sedgewick's heap-based
algorithm, which runs in O(N logN) time, or 8*256.  On the other hand,
each step is slightly more expensive in my implementation, so his may
be faster when the number of distinct characters is small.

Anyway.  If you don't know or don't remember, a "heap" is a data
structure with a number of interesting properties.  The property we're
interested in is that the smallest (or largest) element in the heap is
always at the top.  We're actually going to use an "indirect heap"
which works by maintaining indices to each element, so instead of
shuffling elements around we just shuffle the pointers.  The routine
"pqdownheap()" from Sedgewick re-heaps an array of elements after a
single change; we can also use it to build the heap by considering the
construction as a series of one-item changes.

The first step in build_tree() is to initialize the "heap" array to
contain all the non-zero-frequency characters.  Then we repeatedly
call pqdownheap() on each entry, starting from the end of the array
and working toward the front.  (This comes straight out of _Algorithms
in C_.)

Then comes the amusing part.  We pull the item off the top of the heap
(element 1), and then re-heapify the heap by taking an item from the
end and letting it fall down.  If you're not familiar with heaps, just
take my word for it, it works.  We then take that node and the node
which moved to the top of the heap, combine their frequencies, and
store it in the internal node section of the "freq" array.  Then we
set the "dad" pointers of the two nodes to point at the new node, and
set the child pointers of the new node to point back.  The left child
gets a positive value in the dad array, and the right child gets a
negative value; this allows us to determine if a node is a left child
or a right child just by checking the sign on its "dad" pointer.  Once
this is done, we replace the second node (the one currently at the top
of the heap) with the new internal node, and reheapify.

So, after all this mess, we have replaced the two smallest nodes with
a single node which points at them.

Note that the "dad" pointers are set to "(value) + NUM_SYM-1".  The
idea here is that the leaves run from 0 to NUM_SYM-1, and the internal
nodes run from NUM_SYM to 2*NUM_SYM-1.  Since Sedgewick does the heap
stuff starting from 1 instead of 0, we have to add NUM_SYM-1 instead
of NUM_SYM.

There will be 255 internal nodes for 256 leaves.  This can be seen by
remembering that we are creating a binary tree, so the 256 leaves have
128 parents, which have 64 parents, and so on.  (Since we need a
distinguished stop symbol, we actually have 257 leaves.)

When we are encoding, we need to start at a leaf (the one
corresponding to the input character) and work toward the root.  While
decoding, we need to do the opposite; since we don't know what's in
the input, we need to start at the root and work our way down.  This
means that the "dad" pointers are only used when encoding, and the
"child" pointers are only used when decoding.  To save space,
build_tree() creates both.

-=*=-

Okay, now we have the tree.  If you want to see what it looks like,
remove the comments from the #ifdef DEBUG statements in the C program.
(Doesn't help you if you don't have a C compiler, but fear not, it
isn't all that interesting anyway.)

(In case you missed it, the form of the tree is entirely represented
by either "dad" or "child".  The internal nodes point at the leaves,
and the leaves point at the frequency table.  Since the offset in the
frequency table is equal to the character, we could also say that the
leaves hold the character itself.)

Since we're encoding, we want to compute the bit representations of
every character.  We need two values, a bit pattern and a length.  We
get them by calling compute_bits(), which walks up the tree once for
every character.  It should be fairly clear what the routine is doing
by looking at the C version.  What is important to know is that we are
essentially "stacking" the bits in last in/first out order.

We have to stack the bits because we are traversing the tree from
bottom to top, but the decoder will run from top to bottom.  So, at
each step, we shift the bits to the right, and then set the high bit
(which is on the left).  When we output the codes, we will output them
from left to right, so it is useful to have them lined up with the
left edge of the 16-bit word.

Anyway.  The code[] and len[] arrays get filled in here.  If going
from the leaf to the root requires moving left twice and then right
twice, then going from the root to the leaf would require moving right
twice and then left twice, so the output would be 1100, or $c000, with
a length of 4 bits.

(If you have _The Data Compression Book_, take a look at how he
computes the bit patterns.  Continue with the lesson after you
recover.  The interesting thing about his solution is that it computes
the bit patterns in the opposite order by descending the tree;
unfortunately this requires a recursive search.)

-=*=-

Now we're ready to go.  The main loop of encode() is absurdly simple
by comparison; just read a character and call encode_char().  When EOF
is reached, it encodes the distinguished stop symbol (END_CHAR), and
then tells encode_char() to flush any bits it has left over.

encode_char()'s job is to output the bits for the character passed to
it.  Since the code for a single character can be up to 16 bits long,
and it has to wait until it has 8 bits to output a byte, it will have
at most 23 bits to deal with at a time.  This fits nicely in a long
(4-byte) integer.  This routine is nothing more than a slightly
expanded version of the 8->7 encoder, so if it looks confusing, take a
look back at lesson 1.

The comment about using an array of characters in the C code means
doing something like:

     union {
          struct { unsigned char b0, b1, b2, b3 } bytes;
          unsigned long longword;
     } outbuf;

Then we could output the leftmost byte of "outbuf" by referencing
outbuf.bytes.b3.  However, this is byte-order dependent; most
architectures would use byte 0 instead of byte 3.

-=*=-

So that's it for encoding.  We send the frequency table, unleash a
flood of bits, send the stop symbol, and we're done.  Note that small
files will show negative compression because of the 512-byte frequency
table.  (And yes, I did fix main.asm to handle it correctly.)

-=*=-

Decoding starts out in a similar way.  We fill the frequency table,
but this time by reading from the compressed file.  decode() calls
build_tree() just like encode() does, but it doesn't call
compute_bits().  (Note that if you are drowning in free memory, you
could create a table with 65536 entries - one for each possible bit
pattern - and just do the decoding as a table lookup.)

Once the tree is built, we go right into it.  We grab the input 8 bits
at a time, but logically we are just considering one bit at a time.
At each step, we use the 0 or 1 to decide if we should go left or
right.

The entries in the child[] array are arranged so that the children for
node 10 are found in entries 20 and 21 (10*2 for even (0), 10*2+1 for
odd (1)).  If the entry is less than NUM_SYM, then that child is a
leaf, and the character to output is the value.  If it's greater than
or equal to NUM_SYM, then the child is another internal node.  Since
child[]'s internal nodes start from zero instead of NUM_SYM, we have
to subtract NUM_SYM from the value.

If we find a child, we output the character and reset the pointer to
the top of the tree.  If we find an internal node, we move the pointer
to that node.  In either case, we shift the byte of input one place to
the left, and get another when it's empty.

We continue in this manner until we find the stop symbol.  As soon as
we get it, we know that we are completely done, and can return.  If we
reach the end of the file before seeing the stop symbol, then the file
must be damaged, so we return an error.

-=*=-

There is one MAJOR difference between the C and assembly versions.
There is a great deal of manipulation of arrays with pointers
implemented as 16-bit array indices.  The heap points at element 5 in
the frequency array by storing "5".  Since the offset to the 5th
element of an array is 10 bytes from the start, it's necessary to
double the pointer to get the offset.

The C compiler automatically generates this for you.  When writing
optimized assembly, it's nice to avoid all the shifts.  So, the
"heap", "dad", and "child" arrays all use OFFSETS instead of INDICES.
Since it's done everywhere, it shouldn't be confusing.

While the representation of the arrays is different, the algorithms
themselves are not, and you should get identical results with either.

-=*=-

Depending on the implementation, encoding can be faster than decoding,
because decoding works one bit at a time while encoding outputs
several bits at each step.  Using a decode table with 64K entries (or
even a large hash table with probing) would accelerate the process for
a full tree.

Implementing Huffman is a remarkably complicated undertaking
considering how simple the algorithm is.  Most of the code is used for
generating the tree; the actual encoding and decoding process is
relatively straightforward.  I hope the heap stuff isn't too
bizarre... it's very common, so you might as well get used to it.

This can easily be converted to a static Huffman encoder by removing
the scan pass and replacing the freq[] array with a set of static
values.  If you know what kind of data you will be compressing (e.g.
English text), you can compute the frequency table once and store it.
So long as the encoder and decoder have the same thing, you can skip
not only the scan pass but the transmission of the table as well.

There's no need to limit the routine to a single static table, either.
You could have a collection of tables, perhaps selecting them based on
the type of file to be compressed.  The table index could be
transmitted instead of the whole table.

Perhaps the best choice when the type of data is not known is an
adaptive Huffman encoder, which adjusts its tree after each character.
It tends to be much slower, but it avoids all the hassles involved
with scanning the file and transmitting the tree.  Adaptive Huffman
will be presented next week.

-=O=-     DEEP THOUGHTS

- What causes Shannon-Fano to occasionally use the "wrong" code?

- The UNIX pack(1) command implements semi-adaptive Huffman like we
did in this lesson.  Probabilities are stored as 32-bit integers, and
the output codes are limited to 24 bits.  A comment in the source code
claims that the algorithm will work on any file up to 2^24 bytes long.
Was the author correct?  If not, how large is the smallest file which
will break the UNIX pack(1) command?

- I used $100 as the distinguished stop symbol.  This is also what the
assembly getc() routine returns for EOF.  Is there a conflict?

- Given that we have 257 symbols in the input, why is the frequency
table only 512 bytes?  Shouldn't it be 514?

- I mentioned that you could implement the Huffman decoder as a lookup
table.  How is this possible considering that you don't know the
length of the next code?

- Would a Shannon-Fano encoder be less complicated than the Huffman
encoder?

- The old BLU utility used SQueeze compression (SQ3/USQ2), which was a
simultaneous RLE and Huffman pass (i.e. it did the RLE and passed the
results straight to Huffman; the cute part was doing the RLE during
the frequency scan).  If you have SQ3, pack a file with RLE and then
with Huffman, and compare the results.  If I understand the standard
correctly, this is pretty much what MNP5 does to your modem
transmissions.

-=!=-

This document is Copyright by Genie and Andy McFadden