Hacking Data Compression Lesson 5 Adaptive Encoding By Andy McFadden GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie Today's lesson is on adaptive coding techniques. We begin with a justification of why it's a Good Thing, and drop into a discussion of adaptive Huffman. A description of the elusive splay tree compression is also presented. -=O=- Greetings and Adaptations If you followed the discussion in last week's lesson on Huffman encoding, you should be familiar with some of the issues involving adaptive, semi-adaptive, and static models and encoders. To summarize briefly: - static schemes use a rigid set of probabilites, determined before examining the data to be compressed. - semi-adaptive schemes choose a set of probabilities, and use them throughout the entire file. The choice can be made by scanning the file in advance or using some sort of file type mechanism to determine what the contents should be. The probability set or dictionary is either stored in the program or transmitted before the data. - adaptive schemes start out in a particular state (could start with no knowledge at all or from a basic set of probabilities), and then change the probabilities as the input as processed. We've seen examples of all of these. Differential encoding, the static-dictionary digram compressor, and the MacWrite encoder are all static; last week's Huffman encoder is driven by a semi-adaptive model based on character frequencies; and the MTF scheme described in lesson 3 is adaptive. -=*=- We have already seen that adaptive schemes can lead to cleaner implementations (no tables to shuffle around, no extra scanning passes, etc). However, there are some other advantages. In section 3.2 of _Text Compression_, Bell/Witten/Cleary state: Given a set of conditioning classes with K members, an alphabet with q symbols, and a message of length n, there are adpative models that will take at most Kq log n bits more than the best possible nonadaptive model. This is probably not very meaningful, especially since we never really went into what "conditioning classes" are. I've presented it because there are two interesting conclusions which can be drawn: first, adaptive models will do only slightly worse than static models; and second, Kq log n is approximately the number of bits needed to represent a model of the message, so adaptive models will be about as efficient as semi-adaptive models. It can also be shown that adaptive models are "robust", meaning that at worst an adaptive model will expand the input only slightly. It should be intuitive that you can make a completely static model do a really lousy job (e.g. by feeding a file full of '%'s to a static model of English text). -=*=- If I'm doing my job correctly, you should now be convinced that, at worst, an adaptive model will do only slightly worse than a static or semi-adaptive model. However, there are cases where an adaptive model will do considerably better than either. Suppose I'm using some sort of multi-media format with text, graphics, and sound all in one whopping big file. If we suppose that none of the data within the file is compressed (which is a silly assumption, but hey, this is a contrived example anyway), then we have three sets of stuff to deal with. Suppose we run a static or semi-adaptive Huffman compressor on it. A static set of probabilities is going to gag on the sound and graphics, but will do a nice job on the text. A semi-adaptive compressor will end up with a rather flat set of probabilities, because it will have merged the frequencies from the text (assume English), the graphics (which will likely tend toward a small set of values), and sound (which looks almost random). Since an adaptive Huffman compressor changes the probabilities as it reads in characters, it will adapt to each section of the file in turn. This will allow it to compress each section with the best possible values for the character probabilities. This sort of adaption can also be useful in just plain text files. If you break away from normal writing to display ASCII cow art, the distribution leans toward spaces and characters like '|', '/', '\', and 'o'. Static models just get bogged down, while adaptive models can take advantage of the local changes. -=*=- One important question is, how quickly can (or should) the model adapt? If it adapts too slowly, it will miss out on local phenomena like cow drawings. Too quickly, and the probabilities will be inaccurate, leading to poor compression. The correct answer is, as usual, "it depends." For natural languages like English, you usually want to base your predictions on the last few thousand bytes of data. A very common way to increase the value of recently-seen data is to "forget" old data. This can be done with a "sliding window" (discussed in the upcoming LZSS lesson), by explicitly weighting recent events, or by "forgetting" old information. In a model based on character frequencies, we get a certain amount of forgetfulness if we are restricting the size of the frequencies to prevent overflow. For example, if we periodically halve the frequency counts, we are essentially cutting the weight of previous data by 50%. The next time we cut it, the original statistics are reduced to 25%, and so on. This isn't always a good thing - local changes might be temporary - xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx but for most kinds of data, recent events say more about what to expect next than events which happened near the start. -=*=- All is not sunshine and roses in the adaptive compressor world. There is one major down side to adaptive schemes: the cost of updating the model at every step. The exact cost varies - Huffman is fairly expensive, LZW is almost free - but it's never zero. There may be additional speed penalties depending on the implementation; for example, last week's Huffman encoder stored the bit patterns in an array while it was encoding. An adaptive encoder is continuously changing the bit patterns, so this method is no longer feasible. As we saw last week, it is wise to limit the frequencies to some value, so that the frequency table and output codes can be implemented efficiently. However, we also saw that this meant slashing all of the frequencies in half whenever the total frequency reached a certain value. At first, this doesn't seem significant; ALL of the frequencies have been cut in half, so the tree doesn't need to change. Life, however, is unfortunately not so simple. Because of round-off errors, it's possible for the balance to shift, requiring nodes to exchange positions. Because there's no easy way to adjust the tree to compensate, it's necessary to rebuild the entire tree with the new frequencies. -=O=- Splay trees Let me explain something up front. Very little has been written about splay trees in compression texts. The only place I know of that discusses them is "Application of Splay Trees to Data Compression", by Douglas W. Jones, Communications of the ACM, August 1988, and I've never even read THAT. What I DO have is a piece of code which I pulled off the net about four years ago and managed not to lose. Everything I know about splay trees comes from deciphering that code. Why bother? Why are splay trees interesting? Because they can approximate an adaptive Huffman encoder without ANY frequency counts or other garbage. All of the information is contained in the shape of the tree itself. It turns out that the cost of updating the tree is about equal to adaptive Huffman, and the compression performance is slightly worse, so there's really nothing to gain from using them (consequently, there's no assembly code for them in this week's lesson). There is one big loss however: since there are no frequency counts, there is no way to control the maximum height of the tree (at least, not without maintaining some expensive auxilliary data structures). This means that the maximum code length is equal to N bits, where N is the number of symbols. Outputting 256 bits for a rare symbol might be efficient coding, but it's a real drag to implement. -=*=- How do splay trees work? Fine question. It's actually quite a bit like Huffman, in that you have a binary tree, and output 0 if you go left and 1 if you go right. The only real difference is in the tree update routine, which is remarkably simple. Suppose you have a tree which looks like this: a / \ / \ b c / \ / \ / \ / \ w x y z If you received the character corresponding to node w, you would swap w and everything below it (i.e. w's "subtree") with node c's subtree. Basically, you go one up and one over (the brother of the parent), and swap. So the result would look like this: a / \ / \ b w / \ / \ c x / \ / \ y z (If the next character you got corresponded to node c, you'd end up switching things right back.) Changes to the other side of the tree are just reflected across... if node z got updated in the first tree, it would be swapped with node b. If node z got updated in the second tree, it would be swapped with node x. If the tree were taller (and proportionately harder for me to draw), we would propagate the changes up the tree by updating the "local root" node. For example, when node z got updated in the second example, it was swapped with x; b was the "local root", the parent of the parent. If node a had a parent, then b would've been swapped with a's brother, and so on up the tree. We stop when we hit the root or the level below the root. -=*=- Why does this work? Why does it even come CLOSE to Huffman? Fine question. From this example, it should be clear that when a character appears, it bubbles to the top rather quickly. Each swap moves the node one level closer, and there can be more than one swap if the tree is tall enough. Because it's possible for nodes to "oscillate", it seems likely that frequently seen characters will tend to "hover" in the same region of the tree. This means that frequently seen characters will be given short codes, which is what we want. I suspect that the reason this doesn't do as well as adaptive Huffman is that it adapts too quickly. Infrequently seen characters will end up at the bottom, but will rise quickly when encountered. This may be useful in certain situations, but doesn't model English text very well. The problem is that you can only capture so much state in the shape of a tree with 256 nodes. Anyway. I haven't played with this enough to give a solid analysis. Read the CACM paper if you're really interested. -=O=- PROGRAM (Adaptive Huffman) The model used is the same as last week's: an order-0 character frequency model. The data structures we will use look very similar, but with some interesting twists. The binary tree is again stored as a collection of flat arrays. First, we have the table of frequencies, with one entry for each symbol and internal node. Our set of symbols includes all 256 characters plus $100 for the stop symbol. Recall from lesson 4 that the frequency value of an internal node is the sum of the frequencies of the children. All of the leaf frequencies are initialized to 1 to avoid the zero-frequency problem. Next, we have the "parent" array. At first, it has three distinct levels. The first level, from 0 to 256, represents the leaves; the pointers all point to internal nodes. The second level, from 257 to 512, represents the internal nodes (256 internal nodes for 257 characters, remember?) The root of the tree is at the very end in node 512. The third level is actually BELOW the leaves despite being above it in the array. It has one entry for each character, and points to the corresponding leaf. When we start updating the tree, the leaves and internal nodes will shift around the first two levels. The pointers in the third level will continue to track the leaves around, so that when we are encoding symbols we know where to start, and when decoding we know when to stop. Remember, encoding walks up the tree, decoding walks down. The "son" array holds pointers to the child nodes for every entry in the first two levels. The pointers in the leaf nodes point to the third level, so that you can tell when you've hit the bottom of the tree by looking at the size of the index in "son" (if it's larger than the root, then you are about to reach the third level, and thus are in a leaf). Besides acting as a flag, it also tells us which character the leaf represents. Note that each node has two children; they are found at son[i] and son[i] +1 (so the brothers are always adjacent). This may all seem a bit confusing, but it leads to a remarkably compact and efficient implementation. Take a careful look at StartHuff(), and figure out what is going where. Make a note of the two sentinel values in the frequency table and in parent. -=*=- encode() and decode() are pretty dull. encode() just keeps calling EncodeChar() with the character it gets, and decode() keeps calling DecodeChar() and outputting what it gets back. EncodeChar() is mildly interesting. It starts at the leaf and walks up the tree, pushing bits on as it goes. The bits are pushed on at the left edge and move to the right. When it reaches the root, it transmits the bits. This is just doing what the static Huffman coder did last week, but here we have to do it every time instead of computing it all at once. DecodeChar() is also only mildly interesting. Since we're decoding, we take the input one bit at a time, and walk from the root down until we find a leaf. The character we return is determined by the tree position. As you look at the code, remember that 'T' is the size of the tree, so c+T is the position on the third level (the one which is effectively below the leaves). -=*=- The update() routine does a moderate amount of work for each character. However, after the first 32K (or so) characters, and roughly ever 16K thereafter, the tree will be entirely rebuilt. I'll explain that part in a bit. (Note that the frequency of an internal node is equal to the sum of the frequencies of its children. If you carry this up the tree, you will see that the frequency of the root is equal to the sum of the frequencies of the leaves. Thus, we don't need to maintain an extra "total frequency" counter; the value is in the root.) In the average case, the frequency of the leaf for the character we just got must be increased. However, since the parent's frequency is the sum of the frequencies of its children, we need to propagate the change all the way up the tree. Also, since we need to change the shape of the tree adaptively, we have to swap the node if its frequency exceeds that of its neighbors. The code is sort of strung out, but it's basically this: (1) increment the node's frequency (2) if we're now larger than our right neighbor, then (2a) find the largest node that we are still larger than, and (2b) swap with it (3) continue up the tree When the node moves, it doesn't take its children with it. The two nodes are detached, swapped, and reattached. The reason for doing the search instead of just swapping is that there could be a span of nodes with the same frequency. -=*=- The reconstruction routine is also pretty simple, though the code is a bit messy. It starts by taking all of the leaf nodes and collecting them on the first level of the tree. While we do this, we cut the frequencies in half, taking care that none of them reach zero (this uses the same (val+1)/2 method as last week). So now we have the entire first level of "freq" and "son" set, and the "prnt" array is pretty much invalid. The next step is to set the internal nodes while rebuilding the tree according to the frequencies. The loop uses a trick from StartHuff(), where we step through the kids by two and the internal nodes by 1, ending up at the root at the same time. This uses the fact that the sons of node i are at i/2 and i/2+1. Anyway. At each step, we look backward through the tree to see if we already created nodes with higher frequencies. If we find some (which can be in a leaf or in an internal node), then we slide all intervening nodes forward, and put the new node into position below them. This shapes the tree. Is it the most efficient way to go about it? Maybe, maybe not; it depends on the hardware. It's really very simple once you see the algorithm through all the i's, j's, and k's. The final step in the reconstruction is to set the "prnt" array to reflect the new order. This just steps through the tree, and sets the parent pointer for the son of each node. Note that internal nodes have two kids, and leaves have one child (a pointer to the node on the third level). -=*=- The assembly code supplied this week began its life as a direct port from the C code, but got rearranged slightly for optimization. I had to "dumb it down" to make it readable for the class - the file was more than twice as large as it is now because of unrolled loops. I did leave some of the fancy stuff in, along with some pointers for optimizations. Still, it'll be easier to understand the C code, so even if you don't know C it might be best to have both in front of you. One thing I had tried but gave up one was pre-loading the tree with a set of frequencies. The idea was that, for English text, it would improve compression and reduce the number of adjustments that would have to be made to the tree. It turns out that this implementation adapts quickly to the input, so very little was gained in terms of compression ratio. As far as speed goes, the initial sorting takes about as long as it does to shift the tree around, so there was little gain there. (This could be fixed by pre-sorting the static frequency table, but you're still doing a reconst() at the start, at 16K, and at 32K, whereas in the normal case you do the first reconstruction at 32K. That's two extra reconst() calls, which are not trivial.) In case you want to play with it, I left the code and a table for English in the source file, AIFed out. -=*=- There's a file in the "Test" subdirectory called "Zippy1.1.b". It's a ShrinkIt archive packed in BinSCII format (which is similar to uuencode or Mac BinHex format). Basically, BinSCII takes 8-bit binary data and expands it to occupy 6 bits of every byte. This is useful for posting to Usenet or sending through mail systems which don't like 8-bit data (stuff like control characters or even stray spaces may get eaten, destroying the contents of the file). If you try to pack the file with ShrinkIt, you won't accomplish much. LZW rarely compresses BinSCII data by more than 10%. On the other hand, Huffman does fairly well. Try it and see. Be sure and see how many bits per symbol the output requires, keeping in mind that BinSCII uses 6 bits out of every byte. -=O=- PROGRAM (Splay Tree) As stated earlier, only C code has been provided. I did provide the splay() routine itself in assembly, as "splay.fragment". There are three arrays, "left", "right", and "up", which contain the left child, right child, and parent pointer for each node. We have 257 leaves and 256 parents, but the tree is upside down compared to the Huffman trees we've been building: the root is in entry zero, and the leaves begin at 257. The initialize() routine just sets it up as a fully balanced tree. The encode() and decode() routines work just like they did for Huffman, except that in this case encode() puts the bits into an actual heap array, since the size isn't limited to something nice like 16 or 32 bits. The heap is declared with 514 entries (which corresponds to the size of the "up" array), but I don't think it can go past 257. The encode loop stops when it hits 0 (the root), and the decode loop stops when it finds a child pointer larger than 257 (a leaf). This routine also uses an explicit stop character. -=*=- The splay() routine itself is a straightforward implementation of the algorithm described earlier. It's kind of icky and uses 1-character variable names, but it really isn't doing anything fancy. There may be a more efficient way to implement this, but it translates fairly well to assembly as it is. The "a = c" at the end is just a polite way to terminate the loop. Dunno why the author did it that way, but there it is. The "handle odd node at end" comment doesn't make sense to me, either. -=O=- DEEP THOUGHTS - Last week's semi-adaptive Huffman compressor halved the probabilities whenever the total reached about 4K. If you have a large file with character distributions which change periodically (like the text/graphics/sound example in today's lesson), will the probabilities accurately reflect the distribution in the file as a whole? - The loss in speed comes from two problems: (1) updating the tree at each step, and (2) the inability to pre-compute bit patterns for the encoder. Is it possible to reduce the effects of either or both of these? - Would the speed be improved by limiting the frequency to 4181 as we did for the static Huffman encoder? (HINT: what happens when we hit 4181?) - For what kinds of input would Splay encoding do well? Will it ever do better than adaptive Huffman? -=!=-

This document is Copyright by Genie and Andy McFadden