Hacking Data Compression Lesson 4 Shannon, Fano, and Huffman By Andy McFadden GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie Today's lesson babbles on a bit about some theory stuff, describes the algorithm that Shannon and Fano came up with, and then explains Huffman's improved algorithm. It closes with a discussion of whether or not Huffman encoding is optimal or not. The code du jour is a semi-adaptive Huffman encoder, which requires a slight change to the way we've been compressing stuff (need to make two passes through the file). Most of this lesson is spent explaining how the Huffman code works. (If you've been following the theory side of the course, you may feel like you've been involved in a ping-pong match, because I keep bouncing around between models and encoders. Most texts will start with basic theory, then describe encoding, and then go on to higher-order models. I decided to lay out the lessons according to algorithmic simplicity instead, which makes sense given that this is about "hacking" data compression.) -=O=- From Probabilities to Bits The section header (stolen from _Text Compression_ chapter 5) is a concise description of what "encoding" is all about: taking a set of probabilities and converting symbols to bits in the output stream. That's all well and good, but how to go about it? From what we've seen so far, we know that we want to output very few bits for probable symbols and correspondingly more bits for improbable ones. The encoding "challenge" is to assign bit sequences which exactly match the probabilities. Way back in the '40s, Claude Shannon (often regarded as the father of modern information theory) began exploring the problem. He reached the following conclusion: given a source with entropy H bits/symbol running through a channel with capacity C bits/second, (1) the transmission rate cannot exceed C/H symbols/second, and (2) you can create an encoder which gets arbitrarily close to this limit. Part 1 should not be a revelation at this point. If the entropy of a text file is 7 bits/symbol (using the 7 bit packer from lesson 1), and you can send at 700bps, you can send at most 100 characters every second. Sounds pretty obvious. What's neat is that it can be proven mathematically, which means that it is at least POSSIBLE for an encoder to output data which comes arbitrarily close to the entropy (i.e. the encoder is optimal). -=*=- This nice result doesn't leave us with an algorithm, however. So, Shannon proposed the following. Since it was discovered at roughly the same time by R.M. Fano, it's known as Shannon-Fano encoding. 1. List all possible messages, with their probabilities, in order of decreasing probability. 2. Divide the list into two parts of roughly equal probability. 3. Start the code for those messages in the first part with a 0 and for those in the second part with a 1. 4. Continue recursively until each subdivision contains just one message. This is sort of murky, so let's go through an example. To simplify things I'm going to use character frequencies instead of symbol probabilities. Suppose we have the string "aabbbcddef". If we did a frequency scan, we would get: 'a' - 2 'b' - 3 'c' - 1 'd' - 2 'e' - 1 'f' - 1 For step 1, we sort the list: 'b' - 3 'a' - 2 'd' - 2 'c' - 1 'e' - 1 'f' - 1 For step 2, we cut it into two pieces with equal probabilities (here, equal frequencies): 'b' - 3 'a' - 2 total = 5 'd' - 2 'c' - 1 'e' - 1 'f' - 1 total = 5 For step 3, we assign the first bit of the code: 'b' - 3 0 'a' - 2 0 'd' - 2 1 'c' - 1 1 'e' - 1 1 'f' - 1 1 For step 4, we repeat what we just did, first on the top list, then on the bottom. The top list splits once more, but the bottom list splits into two lists of two items, which will each have to be split. Eventually we will end up with lists with only one element in them, like this (note that the binary codes are in ascending order): 'b' - 3 00 'a' - 2 01 'd' - 2 100 'c' - 1 101 'e' - 1 110 'f' - 1 111 As you can see, items with high probabilities tend to end up in a list by themselves faster than items with low probabilities, because there are a larger number of low-probability items in the bottom list. However, it's not perfect, and occasionally it will assign a longer code to a character with a higher frequency. You should note that these codes have the property that no code is found in the prefix of another. For example, the codes "000" and "010" would be illegal, because codes "00" and "01" exist. This is important so that we know where a code ends when reading a continuous stream of data... the string "aabbbcddef" would be encoded as "0101000000101100100110111" (check it and see). If we allowed both "00" and "000", we wouldn't know if we should stop after the second zero or the third. -=*=- This scheme does fairly well, but it doesn't get as close to the entropy of the data as we would like (for that matter, neither does Huffman). Still, Shannon was able to prove that you COULD get arbitrarily close to the entropy with this scheme, by defining things carefully. If you've taken enough physics classes, this should sound familiar. If you look back at step 1, you'll see that it says to list all possible "messages". What's a message? In our case, it was a single character. If you allow a message to be arbitrarily large, then Shannon-Fano coding will become very close to the entropy. (Recall the example where we sent _Neuromancer_ with a single bit.) In fact, as the size of a "message" approaches infinity, Shannon-Fano codes approach the entropy of the source input. Of course, enumerating an infinite number of messages could take a while, so the theoretical result is difficult to implement. -=O=- Huffman A little while after Shannon produced these results, D.A. Huffman came up with an algorithm which didn't have the occasional incorrect code length problem seen in Shannon-Fano. The idea is to start with a couple of codes and expand stuff instead of starting with all the codes and dividing them up. The algorithm is: 1. List all possible messages with their probabilities. 2. Locate the two messages with the SMALLEST probabilities. 3. Replace these by a single set containing them both, with a new probability equal to the sum of the two probabilities. 4. Repeat until the list contains only one member. This looks a lot like the Shannon-Fano method. Again, let's run through an example. Using the same set as last time, the sorted list after step 1 is: 'b' - 3 'a' - 2 'd' - 2 'c' - 1 'e' - 1 'f' - 1 For step 2, we just grab the two at the bottom of the list, 'e' and 'f'. For step 3, we combine them and add their probabilities: 'b' - 3 'a' - 2 'd' - 2 (e f) - 2 'c' - 1 Note that I'm throwing in step 1 (sorting) as well. Repeating step 2 and 3, we merge 'c' with (e f) and get: 'b' - 3 ((e f) c) - 3 'a' - 2 'd' - 2 Note that we are retaining the set boundaries as we go, and that we aren't generating any codes yet. By the way, this is easy to do in Lisp. Anyway. Repeating steps 2 and 3, we merge 'a' and 'd': (a d) - 4 'b' - 3 ((e f) c) - 3 Again we merge the two with the lowest frequencies, 'b' and (c (e f)): (b ((e f) c)) - 6 (a d) - 4 And finally, we merge the two big honking sets: ((b ((e f) c)) (a d)) Now we need to assign codes to these. The way to do this is to set up a binary tree, with the left element in each set taking the left branch, and the right element in each set taking the right branch. Then we just define a left branch as a 0 and a right branch as a 1. Here's what the tree for this set looks like: (root) | 0 | 1 +----------------+----------------+ | | 0 | 1 0 | 1 +----------+----------+ +----+----+ | | | | | 0 | 1 | | b +------+-------+ a d | | 0 | 1 | +----+----+ c | | | | e f Hope that's reasonably clear. (I tried it with diagonal lines but it got ugly.) By traversing the tree, we see that the codes are: 'b' - 3 00 Shannon-Fano: 00 'a' - 2 10 01 'd' - 2 11 100 'c' - 1 011 101 'e' - 1 0100 110 'f' - 1 0101 111 Comparing the two sets doesn't tell us all that much, but it's interesting to note the differences. It turns out in this particular case that the results will be identical, but in general Huffman will do better. -=*=- Huffman's algorithm generates "minimum-redundancy" codes, which produce the shortest average code length for a given probability distribution. This suggests that Huffman coding is optimal. However, it's only optimal in the same conditions as Shannon-Fano: as the size of the messages approaches infinity, Huffman encoding approaches the entropy of the source. At a less mathematical plane of reasoning, you can see that there is one basic flaw in both of these schemes: they always output an integral number of bits. This means that, no matter how accurate our probabilities are, we are limited by the fact that the output must either be n bits or n+1 bits. In effect, we are truncating the probabilities to the nearest power of 2. A related problem is that these schemes ALWAYS output at least 1 bit per symbol (because we round UP at 0). No matter how probable an event, it will always get rounded to 0.5 (0 or 1 is a 50/50 chance) unless it is the ONLY event. Our intuitive notion of encoding suggests that there is no way around this... after all, how can you encode something in less than one bit? Arithmetic coding can. One semi-amazing demonstration shows how you can compress several bytes into a single bit. Stay tuned for lesson 6. -=O=- PROGRAM There are a remarkable variety of ways to implement Huffman encoding. The one I'm going to present is from Sedgewick's _Algorithms in C_ (a book which holds most every algorithm I've ever heard of). There are different ways of doing the data structures, different ways of searching and building trees, and a variety of methods for storing the tree in the output. Before we get into a description of it, there are some implementation issues which need to be considered. First of all, this is a practical compressor, not simply an encoder, so we need a model to drive it. For our purposes, we will use a character-level order-0 model (i.e. a character frequency count) to define the probabilities. (Aside: think about how much sense this paragraph would have made to you four weeks ago!) This presents a problem when fitting the model in with the code from previous lessons. Before we can compress the file, we need to know the probability distribution of the characters in that file. Which means we have to read it once to get the distribution, and then a second time to do the compression. This requires a minor modification to main.asm and cmain.c, and a new routine (called "scan") in the compression source file. This will be a semi-adaptive compressor, meaning that the compression statistics are generated from the file to be compressed, and are sent along to the decompressor. Thus we will have to send the Huffman table before the data itself, or enough data to reconstruct it. We need some way of knowing when to stop. We can use the length of the source file or a distinguished stop symbol (e.g. $100). I chose to use a stop symbol because it illustrates the fact that Huffman is not restricted to encoding bytes. Since we want this to be fairly efficient, we would prefer to compute the bit patterns for each symbol ahead of time. We'd like to be able to store these in a 16-bit byte for efficiency, but the codes can be much longer (consider a fully degenerate Huffman tree... the length of the code is equal to the depth in the tree). Finally, for clarity in the assembly program and efficiency in both, I'm representing the binary trees in a flat array. The children of node n are n*2 and n*2 +1. The frequencies will be stored in one array, the pointers to parents in another, and the pointers to children in a third. -=*=- The first order of business is getting the frequency counts. scan() does the obvious and maintains a table of frequencies with one entry for each character. To prevent the values from overflowing, when a frequency reaches MAX_FREQ we cut all the frequencies in half. It should be noted that, since we have full knowledge of the input file, we don't suffer from the zero-frequency problem. If a character never appears, we assign it a probability of zero and don't have to worry about it showing up. Of course, we have to make sure that every symbol which appears has a frequency of at least one. For this reason, we increment each frequency before we cut it in half. Note that there is a relationship between the frequencies and the height of the tree. Picture for a moment what a fully degenerate Huffman tree would look like, starting from the bottom. Give the two leaves a frequency of 1 each. This means the parent has frequency=2, and the parent's brother (which is a leaf) must have a frequency of at least 1. The next parent up the tree has frequency 2+1=3, and HIS brother has a frequency of at least 2. (If it were 1, then it would've been swapped with the node below with frequency=2). So far we have a tree which looks like this: 5 / \ 3 2 / \ 2 1 / \ 1 1 If you look at the pattern of the internal nodes, you will see that it looks like this: 1 2 3 5. This doesn't look like much, but if you continue it you will get this: 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 If you've been out of college for too long (or never been), you might not recognize it. It's called a Fibonacci sequence. Each entry n is equal to (n-1) + (n-2), usually starting with "1 1". This means that a tree of height 17 - which requires 17 bits per output code - cannot be built unless the sum of all frequencies is greater than or equal to 4181. (When computing the number of bits, count the TRANSITIONS between nodes, not the nodes themselves. Bits are output when you move between two nodes in the tree.) So we have two choices: we can scale the frequencies so that their sum is less than 4181, or we can just output the codes on a bit-by-bit basis by traversing the tree. Scaling the frequencies is annoying and results in a less accurate model, but the whole encoding process is greatly simplified if there is SOME limit. For simplicity's sake, I decided to scale it whenever the sum reaches 4181. -=*=- Before scan() returns, it has one more task to do. The decoder will need to have an exact replica of the tree we will be building, so we need to put some information at the head of the compressed output. There are two basic approaches. The first is the simpler, which I'm going to use. Just output the table of frequencies, and have the decoder build the tree using the same routines that the encoder uses. This is very simple and straightforward, but always uses 512 bytes (assuming 16-bit frequencies). You can reduce this with an idea from _The Data Compression Book_, by doing an RLE-like encoding on it to remove entries with a frequency of zero. Of course, if the data uses all 256 values, this gains you nothing. The second is to output some representation of the tree we're about to build (note that it helps to build it first and output it afterward). One technique is to output a list of the characters followed by a count of the number of nodes in the tree at each level. It takes a bit of fancy footwork, but you can reconstruct the exact same tree if it's done correctly. Another is to actually dump the tree (like the "child" array), which can save space if there are lots of zero-frequency characters (like in text files). Implementing a more efficient tree storage technique is left as an exercise for the reader. The code is complicated enough as it is. -=*=- Once we have the frequency table, the main program will rewind the input file and call the encode() routine. Our first task is to construct a Huffman tree and then compute the bit patterns to use for each character. If you aren't looking at the code already, I suggest you read through it while reading this. It's also interesting to compare this to Mark Nelson's code in _The Data Compression Book_; he uses some algorithms which are easier to understand but considerably less efficient. For example, we need to pull the two smallest elements out of the array at each step. Nelson uses a linear search, which has a worst case of 128 * 256 operations (O(N^2)). I use Sedgewick's heap-based algorithm, which runs in O(N logN) time, or 8*256. On the other hand, each step is slightly more expensive in my implementation, so his may be faster when the number of distinct characters is small. Anyway. If you don't know or don't remember, a "heap" is a data structure with a number of interesting properties. The property we're interested in is that the smallest (or largest) element in the heap is always at the top. We're actually going to use an "indirect heap" which works by maintaining indices to each element, so instead of shuffling elements around we just shuffle the pointers. The routine "pqdownheap()" from Sedgewick re-heaps an array of elements after a single change; we can also use it to build the heap by considering the construction as a series of one-item changes. The first step in build_tree() is to initialize the "heap" array to contain all the non-zero-frequency characters. Then we repeatedly call pqdownheap() on each entry, starting from the end of the array and working toward the front. (This comes straight out of _Algorithms in C_.) Then comes the amusing part. We pull the item off the top of the heap (element 1), and then re-heapify the heap by taking an item from the end and letting it fall down. If you're not familiar with heaps, just take my word for it, it works. We then take that node and the node which moved to the top of the heap, combine their frequencies, and store it in the internal node section of the "freq" array. Then we set the "dad" pointers of the two nodes to point at the new node, and set the child pointers of the new node to point back. The left child gets a positive value in the dad array, and the right child gets a negative value; this allows us to determine if a node is a left child or a right child just by checking the sign on its "dad" pointer. Once this is done, we replace the second node (the one currently at the top of the heap) with the new internal node, and reheapify. So, after all this mess, we have replaced the two smallest nodes with a single node which points at them. Note that the "dad" pointers are set to "(value) + NUM_SYM-1". The idea here is that the leaves run from 0 to NUM_SYM-1, and the internal nodes run from NUM_SYM to 2*NUM_SYM-1. Since Sedgewick does the heap stuff starting from 1 instead of 0, we have to add NUM_SYM-1 instead of NUM_SYM. There will be 255 internal nodes for 256 leaves. This can be seen by remembering that we are creating a binary tree, so the 256 leaves have 128 parents, which have 64 parents, and so on. (Since we need a distinguished stop symbol, we actually have 257 leaves.) When we are encoding, we need to start at a leaf (the one corresponding to the input character) and work toward the root. While decoding, we need to do the opposite; since we don't know what's in the input, we need to start at the root and work our way down. This means that the "dad" pointers are only used when encoding, and the "child" pointers are only used when decoding. To save space, build_tree() creates both. -=*=- Okay, now we have the tree. If you want to see what it looks like, remove the comments from the #ifdef DEBUG statements in the C program. (Doesn't help you if you don't have a C compiler, but fear not, it isn't all that interesting anyway.) (In case you missed it, the form of the tree is entirely represented by either "dad" or "child". The internal nodes point at the leaves, and the leaves point at the frequency table. Since the offset in the frequency table is equal to the character, we could also say that the leaves hold the character itself.) Since we're encoding, we want to compute the bit representations of every character. We need two values, a bit pattern and a length. We get them by calling compute_bits(), which walks up the tree once for every character. It should be fairly clear what the routine is doing by looking at the C version. What is important to know is that we are essentially "stacking" the bits in last in/first out order. We have to stack the bits because we are traversing the tree from bottom to top, but the decoder will run from top to bottom. So, at each step, we shift the bits to the right, and then set the high bit (which is on the left). When we output the codes, we will output them from left to right, so it is useful to have them lined up with the left edge of the 16-bit word. Anyway. The code[] and len[] arrays get filled in here. If going from the leaf to the root requires moving left twice and then right twice, then going from the root to the leaf would require moving right twice and then left twice, so the output would be 1100, or $c000, with a length of 4 bits. (If you have _The Data Compression Book_, take a look at how he computes the bit patterns. Continue with the lesson after you recover. The interesting thing about his solution is that it computes the bit patterns in the opposite order by descending the tree; unfortunately this requires a recursive search.) -=*=- Now we're ready to go. The main loop of encode() is absurdly simple by comparison; just read a character and call encode_char(). When EOF is reached, it encodes the distinguished stop symbol (END_CHAR), and then tells encode_char() to flush any bits it has left over. encode_char()'s job is to output the bits for the character passed to it. Since the code for a single character can be up to 16 bits long, and it has to wait until it has 8 bits to output a byte, it will have at most 23 bits to deal with at a time. This fits nicely in a long (4-byte) integer. This routine is nothing more than a slightly expanded version of the 8->7 encoder, so if it looks confusing, take a look back at lesson 1. The comment about using an array of characters in the C code means doing something like: union { struct { unsigned char b0, b1, b2, b3 } bytes; unsigned long longword; } outbuf; Then we could output the leftmost byte of "outbuf" by referencing outbuf.bytes.b3. However, this is byte-order dependent; most architectures would use byte 0 instead of byte 3. -=*=- So that's it for encoding. We send the frequency table, unleash a flood of bits, send the stop symbol, and we're done. Note that small files will show negative compression because of the 512-byte frequency table. (And yes, I did fix main.asm to handle it correctly.) -=*=- Decoding starts out in a similar way. We fill the frequency table, but this time by reading from the compressed file. decode() calls build_tree() just like encode() does, but it doesn't call compute_bits(). (Note that if you are drowning in free memory, you could create a table with 65536 entries - one for each possible bit pattern - and just do the decoding as a table lookup.) Once the tree is built, we go right into it. We grab the input 8 bits at a time, but logically we are just considering one bit at a time. At each step, we use the 0 or 1 to decide if we should go left or right. The entries in the child[] array are arranged so that the children for node 10 are found in entries 20 and 21 (10*2 for even (0), 10*2+1 for odd (1)). If the entry is less than NUM_SYM, then that child is a leaf, and the character to output is the value. If it's greater than or equal to NUM_SYM, then the child is another internal node. Since child[]'s internal nodes start from zero instead of NUM_SYM, we have to subtract NUM_SYM from the value. If we find a child, we output the character and reset the pointer to the top of the tree. If we find an internal node, we move the pointer to that node. In either case, we shift the byte of input one place to the left, and get another when it's empty. We continue in this manner until we find the stop symbol. As soon as we get it, we know that we are completely done, and can return. If we reach the end of the file before seeing the stop symbol, then the file must be damaged, so we return an error. -=*=- There is one MAJOR difference between the C and assembly versions. There is a great deal of manipulation of arrays with pointers implemented as 16-bit array indices. The heap points at element 5 in the frequency array by storing "5". Since the offset to the 5th element of an array is 10 bytes from the start, it's necessary to double the pointer to get the offset. The C compiler automatically generates this for you. When writing optimized assembly, it's nice to avoid all the shifts. So, the "heap", "dad", and "child" arrays all use OFFSETS instead of INDICES. Since it's done everywhere, it shouldn't be confusing. While the representation of the arrays is different, the algorithms themselves are not, and you should get identical results with either. -=*=- Depending on the implementation, encoding can be faster than decoding, because decoding works one bit at a time while encoding outputs several bits at each step. Using a decode table with 64K entries (or even a large hash table with probing) would accelerate the process for a full tree. Implementing Huffman is a remarkably complicated undertaking considering how simple the algorithm is. Most of the code is used for generating the tree; the actual encoding and decoding process is relatively straightforward. I hope the heap stuff isn't too bizarre... it's very common, so you might as well get used to it. This can easily be converted to a static Huffman encoder by removing the scan pass and replacing the freq[] array with a set of static values. If you know what kind of data you will be compressing (e.g. English text), you can compute the frequency table once and store it. So long as the encoder and decoder have the same thing, you can skip not only the scan pass but the transmission of the table as well. There's no need to limit the routine to a single static table, either. You could have a collection of tables, perhaps selecting them based on the type of file to be compressed. The table index could be transmitted instead of the whole table. Perhaps the best choice when the type of data is not known is an adaptive Huffman encoder, which adjusts its tree after each character. It tends to be much slower, but it avoids all the hassles involved with scanning the file and transmitting the tree. Adaptive Huffman will be presented next week. -=O=- DEEP THOUGHTS - What causes Shannon-Fano to occasionally use the "wrong" code? - The UNIX pack(1) command implements semi-adaptive Huffman like we did in this lesson. Probabilities are stored as 32-bit integers, and the output codes are limited to 24 bits. A comment in the source code claims that the algorithm will work on any file up to 2^24 bytes long. Was the author correct? If not, how large is the smallest file which will break the UNIX pack(1) command? - I used $100 as the distinguished stop symbol. This is also what the assembly getc() routine returns for EOF. Is there a conflict? - Given that we have 257 symbols in the input, why is the frequency table only 512 bytes? Shouldn't it be 514? - I mentioned that you could implement the Huffman decoder as a lookup table. How is this possible considering that you don't know the length of the next code? - Would a Shannon-Fano encoder be less complicated than the Huffman encoder? - The old BLU utility used SQueeze compression (SQ3/USQ2), which was a simultaneous RLE and Huffman pass (i.e. it did the RLE and passed the results straight to Huffman; the cute part was doing the RLE during the frequency scan). If you have SQ3, pack a file with RLE and then with Huffman, and compare the results. If I understand the standard correctly, this is pretty much what MNP5 does to your modem transmissions. -=!=-

This document is Copyright by Genie and Andy McFadden