Hacking Data Compression Lesson 1 Introduction By Andy McFadden There's something almost mystic about data compression. The ability to take a file full of data, reduce it to half its original size, and then restore it to its former glory has a magical quality to it. Even after the various algorithms have been read and digested, the question remains: why is this even possible? -=*=- I've been fascinated with data compression for the last few years, beginning when I watched an early version of ShrinkIt. It compressed files more tightly than BLU/SQ3 (the previous standard, for those of you who don't remember Life Without ShrinkIt), but worked its magic in half the time. The key was the wondrous LZW algorithm, which I'd never heard of before. I did what you'd expect: I located a copy of the sources for UNIX compress, which also used LZW. It was written in C, so I figured it'd be easy going. Well, compress v4.0 was a single file 20 pages long which violated a few programming conventions, had lousy structure, and what comments it had were directed toward people who already knew how LZW worked. Being new to C, I didn't get very far. I imagine a lot of people have ended up in a similar position, unable to figure out WHY a compression program is doing what it's doing. Compression textbooks are generally unhelpful, since they are usually filled with information theory and funny symbols which don't appear on a standard keyboard. It's hard to write code from a description like, "uses a dictionary scheme to encode previously seen substrings as variable width codes." As it turns out, there's a reason why so little is said about the implementation of the more common algorithms. They're utterly simple. -=O=- OVERVIEW I have two goals for this course: (1) provide a nice fat library of algorithms to code-hungry hacker types, and (2) provide a moderately deep understanding of how and WHY the algorithms work. I will try to break sections cleanly, so that people out to snarf code can grab it and run without having to wade through long discussions about Why Andy Thinks This Stuff is Really Cool. Most lessons will begin with some theoretical musings, have a middle section with the discussion of the algorithm du jour, and an end section with more Deep Thoughts and idea on extensions, improvements, enhancements, and possibly lead-ins to next weeks topic. This week's material is mostly fluff, so algorithm-only types should skip to the sample code. (There are no quizzes or exams. Ignore all you want, I'll type more.) I will judge this class a success if, having read all of the material I've written, a student walks away thinking, "how trivial." If it still looks like voodoo and mirrors when it's all over, you get a refund (all $0.00). I'm trying to refrain from posting a big "PROGRAMMERS ONLY" sign on this course, but let's face it, what I'm telling you won't do you much good if you don't have some basic programming skills. The deeper knowledge of compression may be good for your immortal soul, but be forewarned that much of this course is concerned with application rather than theory, and is described in a way that programmers familiar with bit-twiddling will understand. For pure theory, see the list of references at the end of this lesson. That said, don't be intimidated. There's a real-time conference for discussing the confusing bits, and usually things which don't make sense become clearer after you read them twice. Live working code will be provided in C and/or assembly. I will try to provide both - C for understanding and assembly for usefulness - but I make no guarantees. You WILL need an assembler or a C compiler to try the examples, and while I will try to make the C examples portable you will probably want to run them on an Apple II. I've been using APW C for a few years and I'm perfectly happy with it, so C programs will be written in K&R C. I'll throw in ANSI prototypes for Orca/C fans. Assembly will be for the Orca/M assembler. A list of what this course covers is given in the syllabus. There are a number of interesting algorithms which this course does NOT cover, largely because I don't know much about them. This includes specialized algorithms for graphics and sound (JPEG, JBIG, MPEG), and some routines which take forever (PPMC). I'm not going to spend much time on "lossy" compression, which employs schemes that achieve better compression by ignoring selected portions of the input data. -=*=- How does data compression work? It's easy to explain it in terms of specific algorithms. RLE combines long runs of characters into a single code (so "aaaabccc" becomes "a4bc3"). Huffman replaces more frequent characters with short codes and less frequent characters with longer ones (so "e" would take up two bits instead of 8, while "q" would take 12). LZW replaces entire strings with codes representing previously seen strings (so "face the faces" becomes "face the 1s"). However, there doesn't seem to be a common idea running through all of these schemes. There is, and it's a simple one. All of them try to guess what comes next, and then output how surprised they were. Huh? Look at each one in turn: - RLE sees a character. It's rather simple-minded, so it says to itself, "hey, I bet *ALL* of the rest are just like this one!". If the next character is identical, it stores nothing, because it wasn't surprised. If the next one is different, it just stores the character without compressing it at all. - Huffman assigns probabilities to every possible character. If it sees a common one, like 'e', it's only mildly surprised, so it outputs a few bits. If it sees a rare one, like '^', it's flat out flabbergasted, and outputs a whole lot of bits. - LZW wants to believe that the input is composed of strings it has already seen. If they are, it's mildly (but pleasantly) surprised, and outputs a code which can represent dozens of bytes. If it encounters a string it hasn't seen before, it becomes rather shocked, and the code it outputs may be longer than the string it represents. [ Quick aside: this is why random data doesn't compress (impossible to predict what comes next), and why already compressed data rarely compresses further (most of the surprise has been removed). ] What makes one algorithm different from another is the way it tries to predict what comes next. Like astrologers and fortune-tellers, the value of a data compression algorithm is measured in terms of how well it can predict the future. You may have a question nagging you at this point: we're compressing files, not playing the ponies; why can't we just scan the whole thing and find the optimum way to pack it? First of all, finding an optimal way to pack something is an NP-complete problem (if you don't know what that means, substitute, "it takes a bloody long time"). Second, the information you gathered while scanning the file must be stored somewhere. After all, you won't have the uncompressed file sitting there when it's time to uncompress. If you gather a lot of information, it must be encoded in the output somehow, and this can occupy so much space that the algorithm performs poorly. The bottom line is, compression by blind prediction usually does about as well as scanning the whole thing, so it's usually not worth doing. There will be some examples of this sort of thing later in the course. Anyway. What's important is to view data compression as a way of reducing data to just those things which are unexpected. It's a little spooky at first, because a really, REALLY good compression algorithm would be able to write my next sentence for me, given enough information about the words I use and what I've said so far. In fact, by storing collections of word pairs, it could type every other word for me, and only store data when it guessed wrong. Of course, I can always snarfugle it deliberately. -=*=- One of the textbooks I really like (_Text Compression_, by Bell/Cleary/Witten) makes a very clear distinction between the "model" and the "encoder". The model figures out a set of probabilities for what comes next, and hands that to the encoder. The encoder takes that information, reads the next piece of data, and then outputs a value that reflects how accurate the model was (i.e. how much surprise was contained in the data). Huffman is a good example of an encoder. Typically, a program using Huffman takes a set of input symbols (usually the 256 characters you can represent with a single byte) and a set of probabilities (which is just the number of times each character has appeared), and produces an output code for each. In this case, the "model" is simply "how many times have I seen this character before", and the "encoder" is Huffman's algorithm. However, it's possible to use Huffman as the encoder for a more powerful model, gaining better compression. For example, the model could more accurately predict what the next character will be based on what the LAST character was. The letter 'q' is most often followed by the letter 'u', so if you just saw the letter 'q' then 'u' should have a higher probability than 'e', even though 'e' appears more often in English text. Models don't have to be simple things like character probabilities. Suppose my model is "the full text of _Neuromancer_". If I am told to compress an exact copy of the book, I'm 0% surprised, and output nothing. If even one character is off, I'm 100% surprised, and must output the entire input stream plus one bit. Clearly this is an efficient way to distribute the book, but it isn't good for much else. (Going back a few paragraphs, we can find the optimum compression for ANY book by sending the whole thing first - thus defining the model for both sender and receiver - and then send a transmission 0 bits long. The first step is a bit rough though. This is a somewhat exaggerated justification for processing the data as it comes in rather than analyzing the whole file at once: you don't need to send a whole bunch of stuff before you can send useful information.) -=*=- One very interesting thought process for assessing the value of a model is to stop it at some point and run it *backwards*. The idea is that a really good model will generate output which looks like the data it's trying to model. You can do this in practice by feeding random data to a decoder, but for now just try to picture how the output "feels." Let's check some examples against English text: - RLE will generate long strings of identical characters. This doesn't really look like English. Not surprisingly, RLE does poorly. - Huffman will generate characters at random, but correctly proportioned for English. It'll look like garbage, but there'll be lots of 'e's, 'a's, and 't's. Huffman does reasonably well. - LZW will spit out pieces of words, whole words, parts of sentences, whole sentences, and possibly even entire paragraphs. It may not make much sense, but it sure looks like English. Thus, LZW compresses very well after it reads a lot of input. -=*=- In the beginning, there was nothing. Then the stars and planets formed. Then we showed up. Bummer. I'm trying to avoid excessive quantities of fancy terminology and assorted nonsense in this course, but there are a few terms which come up often and need to be explained. The first paragraph is an example of negative entropy, order from chaos. In the physical sciences, the "entropy" of a system is a measure of the disorder within it. This term was borrowed for information theory, and has essentially the same meaning. The entropy of a collection of data is a measure of the amount of disorder. Turned around, it's a measure of how much information is contained in a message. If we're expecting _Neuromancer_ and we get it, there's no information there - we were expecting it. If the sender passes a copy of _Ringworld_, there's a lot of information that we didn't have. Going back to an earlier section, entropy is a measure of how surprising the input data is. The most interesting result we can derive from this is that, if we can compute the entropy of the source data, we can figure out what the smallest possible size of the compressed output data is. However, the amount of surprise contained in the input data depends on who is looking at it (LZW is surprised less often than RLE when handling English text), so entropy of a message must be computed relative to a particular model. (Note that there IS an absolute entropy for a given set of data. For example, there is a point at which English text can not be compressed further. Unfortunately, nobody knows what it is yet.) In case I haven't confused you enough, I'll throw in one more tidbit: just because the MODEL can produce appropriate symbols and probabilities for a message doesn't mean that the ENCODER can produce output in a way which matches the entropy of the message with respect to the model. We'll get back to this when we discuss Huffman and arithmetic coding. Well, back to reality. Some compression programs will say something like, "entropy was 2.85 bits/symbol", meaning that, for the model used, each 8-bit character in the file had on average 2.85 bits of surprising information. The number of bits per symbol is often given when comparing algorithms. For English text, LZW gets around 4.71 bits/char, LZH gets 3.30, and PPMC (which requires large amounts of time and memory) gets 2.48 bits/char [information comes from appendix B of _Text Compression_]. If you can come up with an algorithm which drops below 2 bits/symbol, you will become very popular in a short period of time. -=O=- PROGRAM This section will discuss the algorithm(s) for each week's programs. Code-only types can start here. This week we're just getting started, so I decided to do something kind of silly. It's a compressor which takes a text file and reduces it by 12%. The trick is to take an 8-bit byte and store it in only 7 bits, which is all ASCII really needs. Of course, if it finds data 8 bits wide, it has to abort. Thus it will only work on pure text files. It stores the 4-byte length of the input file at the start of the output, because otherwise we could be off by one if the last byte has only 1 bit in it (the other 7 might be used or they might not). This isn't supposed to be an incredibly useful tool, but rather something to get everybody warmed up. It uses the compression "shell" and bit-shift operators in C and assembly, so that problems relating to your assembler or compiler will be identified early, and everyone who isn't familiar with C and get a look at a simple example. There are two copies of the program, one in C, one in assembly. The C program has two source files, "cmain.c" and "cp7.c". "cmain.c" provides a simple user interface and will be used throughout the course. "cp7.c" has two functions, "encode()" and "decode()", which do pretty much what you'd expect. Both operate on a stream of data, using the standard I/O functions getc() and putc(). The assembly version has "main.asm" and "p7.asm", plus the usual assortment of macro files and other junk. There's some stuff for GSBug users there too. It works in an almost identical fashion to the C version, except that getc() and putc() are defined in "main.asm". There's one slight difference in the way they handle arguments. Under Orca, ".." must be expanded using the shell wildcard expansion calls. This works when you specify an input file, but not when you specify an output file which doesn't exist. The C code (pulled from NuLib) correctly handles this, but the assembly code (pulled from YankIt) doesn't. The way to use either version is: program e infile outfile to compress (encode) a file program d infile outfile to extract (decode) a file These are rather simple minded, and don't do nice things like checking to see if the output file exists or not (I've found that such a feature just tends to get in the way more often than not). If you want to spruce them up, feel free to do so... as far ahead as I can think, all the code will use the same "cmain.c" and "main.asm" files. When they finish, they will print the size of the file before and after compression. For encoding, it will print the percent compression (where no compression == 0%, reduction by half == 50%, and expansion by half == -50%) and the number of bits per input character. For this week you would expect it to be exactly 7, but because we have to store the length of the original file it will be slightly more. Another contributing factor is that the last byte in the output may not have all 8 bits filled. -=O=- DEEP THOUGHTS Fun stuff to think about: - for this week's example (the 8->7 bit compressor), what is the model? What does it expect? What surprises it? What would it generate if you ran it backward? - an encoder is given the probabilities for a set of symbols. What happens if the input symbol isn't in the set? i.e. we're told there's a 50% chance of 'a' (so we'd output a 0) and a 50% chance of 'b' (so we'd output a 1), but we get 'c' instead. - suppose something really surprising happens. Is there a way to avoid storing the full measure of surprise? What happens to the data when we uncompress it? - in the case where the model was the full text of _Neuromancer_ and the input didn't match, I had to output the entire input, plus one bit. Why the extra bit? - running models backward suggested that LZW outperformed Huffman on English text. For what sort of input will Huffman outperform LZW? (Recall: Huffman generates short codes for frequent characters and long codes for infrequent characters, while LZW replaces repeated strings, e.g. all instances of " the " would be replaced with a code). - an interesting result from information theory is that if a compressor can make some set of input data smaller, then there exists a set of input data which it will make larger. Convince yourself that this is true. - suppose we're in a situation where we CAN'T abort the compressor. How can we change the 7-bit compressor to handle short bursts of 8-bit data? -=O=- REFERENCES First, here's the bibliography from the Frequently Asked Questions List for the Usenet newsgroup "comp.compression" (reformatted slightly): ----- Subject: [7] Which books should I read? [BWC 1989] Bell, T.C, Witten, I.H, and Cleary, J.G. "Text Compression", Prentice-Hall 1989. ISBN: 0-13-911991-4. Price: approx. US$40 The reference on text data compression. [Nel 1991] Mark Nelson, "The Data Compression Book" M&T Books, Redwood City, CA, 1991. ISBN 1-55851-216-0. Price $36.95 including two 5" PC-compatible disks bearing all the source code printed in the book. A practical introduction to data compression. The book is targeted at a person who is comfortable reading C code but doesn't know anything about data compression. Its stated goal is to get you up to the point where you are competent to program standard compression algorithms. {which is more or less what this course is all about!} [Will 1990] Williams, R. "Adaptive Data Compression", Kluwer Books, 1990. ISBN: 0-7923-9085-7. Price: US$75. Reviews the field of text data compression and then addresses the problem of compressing rapidly changing data streams. [Stor 1988] Storer, J.A. "Data Compression: Methods and Theory", Computer Science Press, Rockville, MD. ISBN: 0-88175-161-8. A survey of various compression techniques, mainly statistical non-arithmetic compression and LZSS compression. Includes complete Pascal code for a series of LZ78 variants. [ACG 1991] Advances in Speech Coding, edited by Atal, Cuperman, and Gersho, Kluwer Academic Press, 1991. [GG 1991] Vector Quantization and Signal Compression, by Gersho and Gray, Kluwer Acad. Press, 1991 [CT 1991] Elements of Information Theory, by T.M.Cover and J.A.Thomas John Wiley & Sons, 1991. Review papers: [BWC 1989] Bell, T.C, Witten, I.H, and Cleary, J.G. "Modeling for Text Compression", ACM Computing Surveys, Vol.21, No.4 (December 1989), p.557 A good general overview of compression techniques (as well as modeling for text compression); the condensed version of "Text Compression". [Lele 1987] Lelewer, D.A, and Hirschberg, D.S. "Data Compression", ACM Computing Surveys, Vol.19, No.3 (September 1987), p.261. A survey of data compression techniques which concentrates on Huffman compression and makes only passing mention of other techniques. ----- Now, a few comments about the books I've seen... The first data compression book I bought was _Text Compression_, by Bell, Cleary, and Witten. If you are interested in theory stuff, chapter 2 is a *wonderful* discussion of entropy, models vs encoders, etc. I had to force myself to leave the book on the shelf out of fear of inadvertently plagarising the whole thing. All of the material is presented in a way which is both mathematically precise yet decipherable by people who don't have Ph.Ds in the field. The definitive paper on arithmetic compression was written by them, and the chapter which discusses it is very thorough (and does in fact contain source code). The second book, _The Data Compression Book_ by Mark Nelson, was purchased after finishing most of this first lesson. I flipped through the table of contents and was surprised to see that it was nearly identical to my proposed syllabus! The book is very similar to this course, in that it presents algorithms and gives actual code for them. So the question is, should you buy this book? As always, "it depends." Most of what's in the book will be covered in this course. His examples were written for an IBM PC, but should be portable to other platforms. However, they are generally written in a heavily structured way which matches pseudocode closely but is less efficient. I actually found them harder to read, because once you understand what is going on you don't want to be flipping between a dozen separate routines. I believe that his examples are easier to decipher the first time around (ASCII art illustrations can only go so far), and his descriptions will be easier to understand if you're relatively new to programming. As an added bonus, the book covers lossy image compression (JPEG), which will not be covered in this course. It has one of the most readable descriptions of DCT stuff that I've seen (about 60 pages long), so if you're interested in lossy compression, you may want to check it out. Before you sign off GEnie and never come back, there WILL be things in this course which he doesn't cover, not to mention source code in assembly language. I will also try to spend more time discussing both theory and efficient implementation, the former because I find it fascinating, the latter because it's absolutely vital on an Apple II. I may also throw in a few interesting code tidbits. ;-) -=!=-
This document is Copyright by Genie and Andy McFadden