Hacking Data Compression Lesson 3 Somewhat Clever Ad-Hoc Schemes By Andy McFadden GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie This week's topic covers three schemes which were designed to do a reasonable job of compressing English text. I classify them as "ad hoc" because they were created specifically to take advantage of the properties that English text has, and will do poorly on almost anything else. First up is a quick description of an algorithm that MacWrite used, then comes MTF (Move To Front) presented with an introduction to dictionary models, and finally digram coding. The latter is the most interesting from a theoretical viewpoint because the discussion of it leads into the ideas behind the most powerful compression schemes yet devised. But I'm getting ahead of myself. The only one I've implemented is a simple digram encoder; the other two didn't seem worth the effort. -=O=- MacWrite Let's get this out of the way: Apple, Macintosh, and MacWrite are trademarks of Apple Computer, Inc. I don't know if MacWrite is or isn't, but I feel better (and safer) now that I've said it. Back in '84, the Macintosh was released to the world with 128K of memory and a single 400K disk drive built in. The designers of the original Mac word processor decided that 400K wasn't much room for a whole pile of data files, so they decided to build a simple form of text compression into MacWrite. The basic idea is to take the 15 most common characters in the English language, and store them in 4 bits (say, values 0-14). The 16th value (15) was used as an "escape" character, indicating that the next 8 bits represented a single character. This means that there are 15 characters which will be stored in 4 bits, and (256-15 =) 241 characters which will be stored in 12 bits (4 bit esc + 8 bit char). (Aside: the use of an explicit escape character is fairly common in data compression schemes which focus on a particular type of data. This allows the author to concentrate on the common cases with the tradeoff that some nasty things may happen on the rare ones. An example of this appeared last week in the differential encoding scheme - you can make the lossy sound compression schemes lossless by using an escape character - and can actually be thought of as the foundation of RLE.) Clearly, the choice of those 15 characters is crucial to the success of this scheme. For English text, the traditional wisdom has been that the letters in "ETAOIN SHRDLU" are the most frequent, in that order. What's nice about it is that it can be pronounced, serving as its own mnemonic device. More recent studies have shown that "ETAONI RSHDLC" is probably closer to American English. The characters used by MacWrite are " etnroaisdlhcfp" (the space is significant here, and in fact is the most common *character* in English text). How well does this do? Well, if the set of 15 characters account for more than 50% of the text, the scheme will win (one frequent character plus one infrequent character requires two bytes, which is exactly how much space it would require before packing). It turns out that they account for well over 50%, and this scheme typically reduces text by 20%. -=*=- It's interesting to compare this scheme with the 7-bit packing from lesson 1. Recall that the packing method reduced text by exactly 12.5%, but was incapable of coping with 8-bit data. Here we have a scheme which does much better, and isn't much more difficult to implement. However, it isn't consistent: a file full of the letter 'e' will be maximally compressed (50% of original size), but a file full of the letter 'q' will be expanded to 150% of the original size. This scheme can be improved by turning the compression off for parts of the text which don't compress well. One approach is to do what PackBytes does and chop the file into fixed-size chunks, each of which is compressed individually. Another approach is to insert a different kind of escape code into the data (i.e. different from the binary "1111" used to indicate that an 8-bit value follows), perhaps as 8-bit code 256. When encountered, the compression status (on or off) is toggled. This would cause dramatic expansion if a series of 256s are seen in a row (do you see why?), but this problem could be avoided by using the escape character rotation scheme discussed in the RLE chapter of lesson 2. Another way to look at this algorithm is as a dramatically simplified static Huffman encoder. The advantages and flaws are identical. (If you're not familiar with Huffman, fret not: it's introduced next week.) -=O=- Dictionary Schemes Before I present MTF, a brief description of a "dictionary" model is in order. Dictionary-based schemes work by replacing a series of characters (say, an English word) with an index into a list of words. If your dictionary contains "and", "the", "that", "which", and so on, then occurrences of them would be replaced by the numbers 0, 1, 2, and 3, respectively. If you replaced all occurrences of " the " with a single 8-bit value, you would get a small but non-negligible reduction in size. If you increase the size of your dictionary, adding the 256 most common words in the English language, you will do even better. One universal property of dictionary schemes is: bigger dictionary == better compression Returning to the lessons from the first class, note that the dictionary represents the MODEL used for compression. It says nothing about how the dictionary indices are ENCODED in the output. Some algorithms (like LZW and LZSS) simply store the value, while others (like LZH) will use a more sophisticated encoder to get better results. -=*=- There are many different ways to implement dictionary models. Static dictionaries (such as one with common English words) are easy to program but not very effective. Spanish has a slightly different "feel" to it, but the sizes of words and distributions of characters are fairly similar. A static dictionary designed for English will die horribly on _Don Quixote_. Semi-adaptive dictionaries are built up by scanning the input, and are sent along with the compressed data. These will fare better than static dictionaries, because they use the most common words found in the input. Unfortunately, the cost of sending the dictionary grows as the size of the dictionary grows. Since better compression is achieved with larger dictionaries, there's a break-even point where a large dictionary will begin to make the output larger. All of this can be avoided with an adaptive dictionary model, which builds the dictionary from the input as it works. The first time it sees the word "hoser", it sends it through uncompressed, and adds it to the dictionary. The next time, it sends a dictionary offset, just as a static or semi-adaptive scheme would. The uncompression routine just does the operation in reverse; when it sees "hoser" in the compressed data, it adds the word to its dictionary, and can recognize the dictionary offset when it appears from then on. All of the algorithms in the LZ family (LZW, LZSS, LZH, etc) use adaptive dictionary models. What makes them remarkable compared to other algorithms is that not only do they compress well, they are also extremely fast. -=O=- MTF (Move To Front) MTF compression is interesting because it's a very intuitive way to compress data. It uses a simple adaptive dictionary approach on whole words, where a word is defined as everything between spaces or newlines. Note that it doesn't operate on the whitespace itself, and therefore lends itself to a lossy implementation (where " " and " " would be indistinguishable in the output). This can be worked around, though we won't deal with it here. The algorithm looks like this: for every word in the input if the word is one we have seen before, then output its position move it to the front of the list else add it to the list of words, at the front This gradually builds up a list of words it has seen. The most recent word seen is always moved to the head of the list (an MRU - Most Recently Used - ordering), so that commonly used words will tend to cluster near the front. By encoding small dictionary indices in fewer bits than larger indices, we get better compression than we would if we had simply substituted an integer for a word. That's all there is to it. In theory at least. -=*=- MTF as presented has one fatal flaw: the dictionary grows without bound. If the input has more words than can fit in memory, we're in trouble. An easy solution is to cut off the list at an arbitrary value, say 4096 entries. Because we're moving the most recently used word to the front, the entries which get dropped will be rare, so transmitting them again will not be a major burden. This is not a complete solution - if each word is 8000 characters long, we may still run out of memory - but you get the idea. One interesting property is that, at any time, the dictionary will contain the 4096 most *recently* seen words, NOT the 4096 most *frequently* seen words. If I go insane and write "All work and no play makes Jack a dull boy" for several hundred pages, those 10 words will still top the list of most frequently used words even after followed by a couple hundred pages of newspaper clippings. However, MTF will have forgotten all about them 4096 words later unless they appear again. Having the adaptation weigh recently seen data more heavily than older data can be a big win sometimes. For example, I've used "4096" repeatedly in the last few paragraphs, but didn't until this section and won't again. The danger is that, if adaptation moves too slowly, the model will spend all of its time adapting to the changes and never "lock on" to what is occurring frequently and what isn't. -=*=- A few notes about MTF. I didn't implement it because it gets kind of "messy". First, there's the dictionary. You want a data structure which can be inserted into at the head and deleted from in the middle, like a doubly linked list. Second, unless you want to impose a minimum word length (good) and a maximum word length (bad), you have to deal with variable-sized words. Doing this efficiently (i.e. not calling malloc() every time) can become a chore - especially if memory begins to fragment. Third, you need a way to efficiently encode integers of various magnitudes. The appendix to _Text Compression_ contains a list of good schemes. Since this is an encoding problem, it is also possible to use one of the generic encoders (Huffman or arithmetic encoding), and in fact MTF does surprisingly well when combined with one: Bell/Cleary/Witten found that it did significantly better than LZW and LZSS on English text. Where this part gets messy is when trying to figure out what is an uncompressed word and what is a dictionary offset when trying to uncompress. -=O=- Order-N models Most algorithms which model natural languages like English try to predict the next character based on the input seen so far. Simple schemes, like the typical Huffman compressor, compress each character according to its frequency. This works well enough, but you can do better: as I pointed out in lesson 1, you will get more accurate probabilities if you take into account that 'q' is almost always followed by 'u', so 'u' should have a higher probability than 'e' if the previous character was a 'q'. The simple Huffman encoder is an example of an "order-0" model. It has a set of probabilities which do not take into account any of the characters immediately preceding the current one, and do not reflect the order in which the characters appeared. This means that "the cat in the hat" will be compressed in exactly the same way as "aate hte ctn htih", even though the former is English while the latter is gibberish. The problem is that the order-0 model is too general, and is missing out on characteristics specific to English text. Using the previous character to determine probabilities is known as an "order-1" model, because it looks one character back. Instead of a single set of probabilities, there are 256 sets, one for every possible previous character. The set of probabilities is larger and more difficult to maintain, but is considerably more accurate. How much better does it get? I don't actually have a quantitative measure. However, if you work it backward, you will get the most frequent character pairs instead of the most frequent characters. Since there are some combinations which never occur (find me an "xq" in Shakespeare), the resemblance to English will increase. Does the modeling get better as the order gets larger? Yes. You begin to get fragments of words, whole words, and parts of sentences. After a point however the models become large and unwieldy, and are unusable in practice. Also, after a point you end up with sequences which are unlikely to repeat... for example, there are no 100-character sequences in this file which are repeated. You can also have an order-(-1) model, in which all probabilities are equal (e.g. 1/256 for every element in the set of 8-bit characters). Now take a big step back. All of the above applies to arbitrary chunks of data. You can have order-N models for words, sentences, 8-byte blocks, or anything else. The order just tells you how far back it looks in terms of the unit data element. It also works just fine for input other than English text. There are some examples of running English order-N models backward in _Text Compression_, pages 80-81 (by character) and 84 (by word). Some are kind of funny, like, 'She stammered, not bodily into water at then kissed here and in color; bright red, with local assessing units".' -=O=- A simple digram dictionary model A "digram" is a pair of symbols. For this discussion, we will use it to mean a pair of characters in a text file. A digram encoder is an order-1 model because it considers two characters at a time, the current one and the previous one. A "trigram" encoder would be order-2, "tetragram" would be order-3, and so on. The simplest use of digrams is a dictionary scheme which replaces pairs of characters with a single-byte dictionary offset. It takes the input two characters at a time, and looks the value up in its dictionary. If it finds it, it outputs the dictionary offset. If it doesn't, it outputs the character. The sample program for this week does exactly this. Details are in the "PROGRAM" section. -=*=- When we did the 7-bit ASCII packer (the one that packed it down to exactly 7 bits), we had to abort if we encountered a character with the high bit set, because we had no way of dealing with it. We can view this scheme as an order-(-1) model, in which the first 128 characters have an equal probability (1/128), and the last 128 have a probability of zero. This illustrates what is know as the "zero frequency problem." To make it possible to encode a character, we must assign it a non-zero probability, no matter how unlikely it may be. If we're just dealing with 256 probabilities (one for each character), it isn't a big deal. Now suppose we have an order-10 model. If we are modeling English, the string " etaoinshrd" is unlikely to appear. However, it must be given a non-zero probability or we will be unable to encode it. In fact, we must assign and maintain probabilities for ALL 11-character strings. This is prohibitively expensive. Even if we only allow the 96 or so printable characters, that's 96^11 probabilities to keep track of. You might be tempted to suggest that we should just add the new entries as we go along. Well, guess what, we can, but it's not as easy as you might think. Keep in mind that this is NOT a dictionary scheme; the code output for each character depends on the probability of that character occurring, which in turn depends on what the last 10 characters were. We're encoding this stuff one character at a time, not 11 characters at a time. -=*=- The solution is called "blending." There are all sorts of ways to go about it (see chapter 6 in _Text Compression_), but the basic idea is to combine an order-(-1) model with order-0, order-1, order-2, ..., order-9, and order-10 models. Suppose a given sequence of 10 characters has never been seen before, so the order-10 model has no way of guessing what the next character is. All characters have probability 0. If the order-9 model HAS seen it, then it will have a nonzero probability. If we combine all of the probabilities together (including the order-(-1) model, which gives equal probabilities to all characters), we are guaranteed to have a nonzero probability. At first this appears to make the order-10 model less efficient, because it gets "shouted down" by the other models. This can be sidestepped by weighting the probabilities from the higher-order models more heavily, reducing the effect that the lower-order models have. If the probability for a model is zero, then the model has no effect and the lower-order models control the outcome. Another way to solve the problem is by "escaping" down to a lower model when a series of characters is seen for the first time. This is done by transmitting an explicit escape code to the decoder. Note that the escape code must have a nonzero probability or it will be impossible to transmit! Thus we need to be able to estimate the probability that we will be clueless as to what the next character will be. This type of algorithm - which is really just a digram model on steroids - is known as "context modeling", because the model depends on the current "context" (i.e. the last few characters). There are related ideas called "state modeling" or "Markov modeling". We won't be going into any of these in detail because they use LOTS of memory and are VERY slow. -=*=- The static dictionary digram compressor in today's sample code can be expressed as an order-1 context model with a static set of equal probabilities. The characters not in the dictionary have a probability of zero, so we escape to the order-(-1) model and output them that way. The escape code is in the high bit. It turns out that ALL dictionary encoders, including fancy ones like LZW and LZSS, can be expressed by an equivalent predictive model. A mathematical algorithm exists which will do the conversion. -=O=- PROGRAM This week's sample program has a static dictionary of 128 character pairs. It reads the input, and if it sees a pair of characters that appear in the dictionary, it outputs the dictionary offset with the hi bit set. If not, it outputs the character with the hi bit clear. You may be wondering, where does the dictionary come from? Then again, you might not. So again, it only accepts 7-bit ASCII data. It could be made to accept 8-bit data, but then every character in the output file would have to be 9 bits, and that tends to obfuscate a relatively simple algorithm. It's also instructive to do a hex/ascii dump on the encoded output and see what's left - notably all the upper case letters (why?). If the output codes are always 8 bits, it's much easier to interpret. When I say the programs are simple, I mean simple. I use a linear search through the 128 entry dictionary, and I use arrays instead of pointers in the C version, so the encoder is a tad slow. The decoder is as fast as anything else we've seen so far because it either outputs the character it gets (if hi bit is clear) or outputs the two characters in the dictionary at the index specified by the character (if hi bit is set). Oh yeah, the dictionary. The C version has a separate include file called "freq.h", and the assembly version just has a bunch of DC constants in the data segment up front. The two are identical. Both were generated by "scanpair.c", which takes as an argument a file to scan, and produces on stdout (read, "it dumps on your screen") a ready-to-go "freq.h" with values for the input. There is no assembly source code equivalent for it. However, there are two different versions of the C code - one generates output for "freq.h", the other generates it for an assembly DATA segment. They're identical except for the output formatting routine. Both executables are included. To use them, do something like this: scanpair cmain.c > myfreq.h for C stuff ascanpair main.asm > mydata.asm for asm stuff The output of scanpair is more enlightening than the output of ascanpair, so even if you don't plan to use the C code, be sure to look at C/freq.h. The freq.h I've included was generated from a large blob of English text, and should be reasonably representative of English (compare it to the results on page 79 of _Text Compression_). Compressing the source file from which the statistics were gathered gave me about 37% compression. Since the maximum possible is 50% (all input characters reduced to digrams), I'd say it worked fairly well. Compressing the assembly or C source files gives about 27% compression, because they have a different distribution of characters than English does. Some ideas for improvements are in the next section. Implementing them is left as an exercise for the reader. Administrative note: nothing changed in main.asm, so I omitted it from the lesson. Instead, I included MAIN.ROOT and MAIN.A, which are smaller. (This was done to reduce the size of the archive, which is going to be bloated with the two C programs I included...) -=O=- DEEP THOUGHTS - Why is a minimum word size useful for MTF? Why is a maximum size bad? Is it bad? What role does it play when compressing binary data? - How can whitespace be handled correctly for MTF compression? - I compared the MacWrite character substitution to Huffman. However, it also looks a lot like a dictionary scheme. Which is it? - What would you get if you ran each of these backward? Can you see why MTF does so well when compressing English? - Computing the frequency tables ahead of time is a pain, because the values are compiled in to the compressor. There are two alternatives to static dictionaries, adaptive and semi-adaptive. Semi-adaptive would compute the table right before compressing, and put it in the file before any compressed data (resulting in a 256 byte header). Adaptive starts with an empty dictionary and fills it in as it goes along. What would be an easy way to implement an adaptive digram encoder? (HINT: it's in this lesson!) - Was the freq.h file included in this lesson generated from Apple II text files? - Why doesn't MTF suffer from the zero-frequency problem? Or does it? -=!=-
This document is Copyright by Genie and Andy McFadden