Hacking Data Compression Lesson 2 Basics By Andy McFadden GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie This lesson covers some really basic stuff which everybody knows how to do. Why bother with it? For one thing, to be complete. For another, not everybody has copies of RLE code lying around, and sample code is worth its weight in silicon. Two basic ideas will be presented here, RLE and differential coding. We also take a brief detour into the world of lossy compression. If you already know everything there is to know about this stuff, go off and twiddle your thumbs for another week, when we'll pick up with some more interesting (but still conceptually simple) algorithms. -=O=- RLE RLE stands for "Run Length Encoding". It works by taking a series of identical bytes (a "run") and replacing them with a code which specifies what the character was and how many of them there were (the "length"). The usual way to form codes is with a distinguished "escape" character. Escape characters don't have anything to do with the Esc key (ASCII 27); that's just what they're called. Think about how hitting the Esc key in most programs makes the program stop its current activities and do something else. The usage here is similar. When a run of characters is encountered, say: lda #$1234 ;load the secret number an RLE encoder would compress the "run" of spaces between the "1234" and the ";" by outputting the escape character, then a space, and then the number of spaces. If your escape character were ']', your output would look like: lda #$1234] 8;load the secret number ^^^ |||--- count == 8 spaces ||--- character is a space |--- escape character Decoding is just the opposite of encoding. The input data is copied to the output until an escape character is hit. At that point the decoder reads the character and the count, outputs that many characters, and then resumes copying data. That's basically it. However, there is one slight complication: what happens if the escape character appears in the input? If we just copied it to the output, the decoder would see it as the start of a run of characters, and would output garbage. The solution is to output it as a one-byte run of characters, so you'd output an escape character (indicating start of run), another escape character (indicating the character to use), and then a one (number of characters). Note that we can still encode runs of escape characters in the usual way. Since we use 3 bytes to output a run, we only lose ground if we encounter a solo escape character or two in a row. If we encounter three escape characters in a row, we would output "]]3", which is exactly the same size. Thus, the maximum expansion for RLE is +50% of the original size, for a file which looks like "]a]a]a]a]a]a]a". From this we can see that the choice of escape character is fairly important (the value $db - high-ASCII '[' - is a popular choice). It should be the least frequently seen character in the input. We can make it less important by CHANGING it every time. For example, we could add some number relatively prime to 256 (say 51) to it every time. That way we only repeat escape characters once every 256 runs, so a burst of '$db's won't screw us up. So long as both the encoder and the decoder adjust the escape value after each run, there's no chance of confusion. Note that this does make the data more susceptible to transmission errors. Text compressed with RLE is still more or less readable. If some error (say, modem line noise) caused the decoder to become out of sync with the encoder, it would look for the wrong escape character and would start spitting out garbage periodically. (NOTE: I don't know if 51 is a good choice or not. Seems nice enough.) -=*=- Note that it isn't necessary to do the encoding exactly as described. Some routines reverse the order of character and escape code; it may make the encoder or decoder simpler in certain situations. It's also possible to avoid using an escape character altogether by assuming that two identical characters will be followed by a length byte. For example, if the input were: abcDDDDefGGhi the output would be: abcDD2efGG0hi ^ ^ | |--- zero Gs follow |--- two more Ds follow indicating that there are two more 'D's, for a total of four. If there are only two characters, as with the 'G's, an extra 0 must be added. So the worst case for this scheme is a series of double characters (like "aabbccdd"), which must be encoded in three bytes ("aa0bb0cc0dd0"). Again, this gives a maximum increase of +50%, so we aren't gaining anything. If the input text contains more occurrences of double characters than occurrences of the escape character, we will do worse. -=*=- A few efficiency notes... Since we represent runs with three bytes, there is no reason to represent runs of two or three bytes specially (in fact, specifying a run of two bytes would be a loss). In terms of decompression speed, the overhead involved in outputting a run of three characters is higher than that for just copying three characters to the output, so RLE encoders should only encode runs of four or more. Whenever we decode a run, we can assume that there will be at least one copy of that character in the output (why else would we be doing it?). Thus the sequence "escape|char|0" has no meaning; we just told the decompressor that there was a run of 0 bytes, so please get on the ball and output all zero of those bytes right this very minute. We ought to change the meaning slightly, so that the length byte is one LESS than the length of the run. Now a length of zero represents one byte, and 255 represents a run of 256. In fact, if it weren't for the special handling needed for the escape character, we could treat the length byte as length-4. (And if you want to get really confusing, you could handle the length byte for runs of escape characters specially. But the extra overhead involved is probably not worth the effort.) -=*=- It's a simple-minded scheme and, not surprisingly, the results aren't very good for most files. It does well when the data being compressed has large sections of the same character, such as large sparse //gs executables, disk images, or simple graphics images. (ASIDE: ShrinkIt does an RLE pass before the LZW. This was done because ShrinkIt began its existence as a disk packer, not a file packer. Since most disks will have large sections of unused space filled with zeros, RLE was helpful. However, it does very little when packing normal files, and can sometimes get in the way.) The primary advantage of RLE over other methods is speed. It's hard to compress data faster than RLE does, because it spends most of its time doing little more than copying data. -=O=- PROGRAM (RLE) There are two basic ways to implement RLE: operate on a stream or on a buffer of data. The stream approach is more widely useful, and fits in with the way other compressors work, so it will be used here. Operating on a buffer will usually be faster because you don't have to deal with the overhead of character I/O, but you end up having to compare pointers with the end of the buffer, so the routine itself becomes more complex. At any rate, our sample code takes a stream-based approach, so that is how our example is written. I threw in the RLE decoder from YankIt as well. It's interesting to note that the encoder can be surprisingly complicated, especially when compared to the decoder. The reason for this is clearn when you represent RLE as a finite state machine with one state representing the number of matches seen so far. The machine needs 256 states to count the run length, and the first three treat the end of a run differently than the rest. It's also complicated by the presence of escape codes, which need three entirely different states, and by an extra transition needed for when the end of data is reached. The decoder also has 256 states, but 255 of them are identical. There's basically one state for "not handling a run" and one for "handling a run"; the rest are all just there to count the number of characters. The C and assembly code function identically, but there is a slight difference in the way the encoders act at the end of input. For the C version I just copied the code (about 8 lines), for the assembly version I did some fancy skipping around. Both work pretty much the way you'd expect. The encoders get a character and compare it with the last one. If it matches, look for more. If it doesn't, output what we have. The decoder is almost trivial. That about wraps it up for generic RLE. Try it on some files and see how it does. One good example is "main.asm" in the assembly sample files directory... Orca assembly files (well, the ones without tabs in them) squash extremely well with RLE. There is also a "test" subdirectory which contains some RLE test files. -=O=- PackBytes PackBytes is in the Miscellaneous tool set of the //gs toolbox (call number $2603, described on page 14-39 of Volume I of the Toolbox Reference). It's most commonly used when compressing Super Hi-Res images into Apple Preferred format (type $c0/$0002). It bears a very strong resemblance to RLE - the basic idea is the same - but it goes about it in a different way. The output is a series of chunks which begin with a one-byte header. The header byte has two flag bits and 6 length bits, defined as follows: 00xxxxxx - 1 to 64 bytes follow - all different 01xxxxxx - 3, 5, 6, or 7 repeats of next byte 10xxxxxx - 1 to 64 repeats of next 4 bytes 11xxxxxx - 1 to 64 repeats of next byte taken as 4 bytes Let's take them one at a time. For 00xxxxxx, you have the two flag bits (00) and 6 length bits (xxxxxx). With 6 bits you can represent a number between 0 and 63; they used the "length-1" trick and treat it as a length between 1 and 64. If the flag bits are 00, then the next 1 to 64 bytes of data are to be copied without change. If the flag bits are 01, then it only accepts 2, 4, 5, or 6 and adds one, giving 3, 5, 6, or 7. Yes, it's weird, but it'll make more sense in a minute. It takes the next byte in the input, and repeats it that many times. This is similar to what we were doing earlier, except that the escape char is in the first two bits, the length is in the next 6 bits, and the character to repeat comes last. I don't know what PackBytes would do if you gave it a length other than 2, 4, 5, or 6. Be afraid. Be very afraid. If the flag bits are 10, then it repeats the next 4 bytes 1 to 64 times. So if you have the sequence "ABCDABCDABCD" it would code "10 000010 ABCD", meaning 'output "ABCD" 3 times'. This is a bit more sophisticated than generic RLE because it can take advantage of things like repeated two-byte integers stored in a data file. If the flag bits are 11, then it takes the next byte, expands it into 4, and then repeats it 1 to 64 times. This is like generic RLE but with a repeat count of 4, 8, 12, etc. Note that by combining this with the repeat counts of 3/5/6/7 above you can represent any run of 3 or more characters (now does it make sense?). I'm not going to describe the interface. I will say that, ON PAIN OF DEATH, you MUST read IIgs Tech Note #094, "Packing It In (and Out)" if you plan to use these routines. It explains some common mistakes and some bugs in the routines. -=*=- Does PackBytes do better than generic RLE? Sure looks that way. By merging the length byte and all but doing away with the escape codes, it reduces the amount of excess garbage we need to output. However, shaving one byte here and there doesn't really amount to much, and the added complexity of the algorithm makes for a rather unpleasant speed trade-off. One place where it DOES do well is for handling patterns of two or four bytes. If your goal is to compress English text, you will not see an improvement. If you are compressing a big pile of identical two or four-byte integers, this will help quite a bit. How much can it expand a file? By one byte for every 64. This is far better than RLE's worst case. Note that I don't think I've ever actually used PackBytes, because the interface is kind of annoying unless you know how big the data will get before you start to uncompress it. There's no sample code which uses it. A brief (but sort of icky) sample program appears in the toolbox ref; it's probably worth staring at it just to see how it works. -=O=- Differential And now for something completely different. Differential encoding does not make the data any smaller. Thus, it isn't really a "compression" scheme in itself; rather it is a way of modifying the data so that a compressor can make more sense of it. Switching back to theory mode for a moment, differential encoding techniques try to change the nature of the data so that the entropy with respect to a particular model is lower. The techniques described here are very simple, but the ideas behind them are powerful and crop up now and again. -=*=- Let's start with a simple example: digitized sound files. Your basic system beep is a digital representation of analog data which tends to have a wave-like nature. This means that it is rare to see a long series of identical values; more likely you will see a series of increasing values followed by a series of decreasing values. To most compressors, this looks like random garbage. There are few identifiable patterns, and the distribution of values is relatively flat when compared to something like English text. However, a series of ascending values may be part of a straight line in the wave, so each is higher than the last by a fixed increment. This suggests that if we could compress the slope instead of the points, we might get somewhere. The easiest way to do this is by storing the DIFFERENCE between the current value and the last one. So if the input were: 5 3 5 8 10 12 13 15 the output would be: 5 -2 2 3 2 2 3 2 Because each point in the digitized waveform is related to its neighbors (assuming a continuous wave), the difference between adjacent points will tend to be small. We can run the new data through a Huffman encoding pass (which encodes frequent symbols in fewer bits, infrequent symbols in more bits) and get significantly better compression than we could have achieved without the differential pass. Since the values tend to cluster near zero, Huffman will have a probability distribution over a small area instead of one spread almost linearly over all 256 values. However, life is not a bed of roses. Or even geraniums. Digitized sound tends to have a lot of noise, which produces discontinuities in the waves. A noisy sample will end up having large, sudden shifts in values (audible "pops"), and the output of the differential encoder will have a wider range of values, causing Huffman to do poorly. -=*=- If you take a look at the sample differential output above, you will note that they're all rather small. Suppose you had this bright idea that you could represent the values in exactly 4 bits. This has the dramatic effect of cutting your sound file in half. You are guaranteed to get exactly 50% compression; why let Huffman have all the fun? However, there's one small problem: what do you do when the next value exceeds the possible range? Enter lossy compression. Lossy compression differs from lossless compression in that the compressed data does not need to be an exact representation of the source data. This sort of thing would be unacceptable for English text (although the Cliff's Notes people haven't suffered noticeably), but for digitized sound samples it's perfectly acceptable. You can drastically alter a sound file without making changes detectable by the casual observer - slight changes in pitch, the addition or removal of a faint click, etc. Putting a limit on the difference might actually remove some of the excess noise from a sample. Still, a range of -8 to +7 or so isn't very much for 8-bit sound. Most likely we would end up flattening the waves, ending up with a rather muffled, lifeless noise. We can do better. -=*=- Suppose that, instead of storing the difference between this byte and the next, we stored the difference between where we PREDICT the next byte will be and where it ACTUALLY is? Suppose we've seen the sequence: 10 14 21 25 30 the simple-minded approach would record the difference between 30 and whatever came next. If the next value were 40, we'd be out of luck, and would have to store a "7" and be content with the loss. In fact, if the stream of data increased by 10 each time, we'd never catch up until it hit 255 and came back down (muffled, lifeless...). By examining the stream, we see that the most likely value for the next byte is 35. If the value IS 35, we store 0. If it's 40, we store 5. We're still out of luck if it's 45, but by analyzing the stream we can guess that the next value will be higher - say 55 - and will eventually catch up. Another possibility is to use non-linear differences. If our sample starts to do something like: 1 2 4 8 16 32 64 128 we're never going to be quite in step. What we can do instead is double the size of each differential step, so that instead of -8 to +7 we would have -16 to +14 (using -16, -14, -12, -10 instead of -8, -7, -6, -5). This makes our differential value inexact, but it will be off by at most 1. Such a change is probably inaudible, and is certainly less noticeable than being off by 32 or so. Reverting to theory mode once again, we can see that the entropy of a sound file is directly related to how good our MODEL of sound is. If we have the original brain-dead model, which assumed that the next value was going to be identical to the current value, we don't do very well. A simple model which just looks at the last two values and interpolates a third will do considerably better. A fancy model which watches for trends or takes into account the sine-wave nature of some sounds can do even better, possibly accomplishing in 3 bits what the earlier models can do in 4 or more. Other techniques take into account the nature of sound or of the human ear; for example, faint noises or high pitched sounds may be inaudible when played at the same time as loud low frequency sound. One popular model is Adaptive Differential Pulse Code Modulation, or ADPCM (also known as CCITT G.721). This is what the ACE tool set in the //gs toolbox uses. A fairly simple description can be found on pages 27-4 and 27-5 of volume 3 of the Apple //gs Toolbox Reference. From what I've been able to sort out, it appears to use predictive differential coding with a range of values which vary depending on the rate of change in the input. In other words, it does most of the stuff I just described. -=*=- Well, I got pretty lost in the implementation. The important idea in the previous section is that we PREDICT what comes next and then store the amount of ERROR in our prediction. In other words, we record how surprised we were. Sound familiar? -=O=- PROGRAM The source code included here does the simple differential coding, intending it to be used as a preprocessor to something like a Huffman encoder. It transforms each byte into the difference between it and its predecessor (modulo 256). The algorithm starts with an initial value of zero. It is so utterly trivial I'm not going to bother describing how it works. Just look at the code. It seems to help a bit when compressing sound files with ShrinkIt, so if you don't happen to have a Huffman encoder lying around, try that. But don't worry... in a couple of weeks, you'll have one. You will get the best results on samples which are fairly smooth and continuous; digitized musical instruments come to mind. Speech pulled from movies may not do so well. -=*=- I have also included the C source to a fast ADPCM routine which was intended to work on 16-bit sound samples (it does 16:4, which is 75% compression). I haven't bothered converting it to assembly or adjusting it for 8-bit sound samples because we have the ACE toolset available. However, the code is easy to understand (it has comments in all the right places), and it's really short. The neat thing about it is that you can tell it to not use any multiply instructions, so a //gs version would be pretty quick. It'd be interesting to do the conversion and see how it compares to the ACE toolset; any takers? -=O=- DEEP THOUGHTS - would an advantage be gained by using a two-byte count for RLE? What are some alternative ways of encoding the length for data with really long sections of zeros? More interestingly, what effect would considering data a (2-byte) word at a time have? (i.e. you compare the bytes at 0 and 1 with those at 2 and 3, then 4 and 5, and so on.) - would data generated by PackBytes look significantly different from that generated by RLE? From this, would you conclude that PackBytes compresses English text better, worse, or about as well as RLE? - suppose you wanted to compress data first with RLE and then with the 7-bit compression from last week. What changes would have to be made to RLE to accommodate the 7-bit scheme? Does the compression have to be done in two separate passes (i.e. do the first pass to a file, then execute the second pass and send the output to another file)? - the examples for differential compression were done on bytes in a sound file. What are some other kinds of data which might benefit? Ideas: words in the dictionary, pictures in games like The Bard's Tale. Note that for large data items like graphics images you can get away with doing a differential pass and then a simple RLE scheme which just removes zeroes from the data. -=!=-
This document is Copyright by Genie and Andy McFadden