Hacking Data Compression Lesson 11 LZH, LZARI, and LZB By Andy McFadden GEnie A2Pro Apple II University - Copyright (C) 1993 GEnie This is the last of the lessons dealing with specific compression methods. The twelfth and final lesson will be on writing HardPressed(tm) compression modules. All three of the methods described this week are nothing more than LZSS with improved encoding. Thus, lesson 7 (LZ77/LZ78) and lesson 10 (LZSS) are prerequisites. -=O=- A brief history of time I've included a file called "History" in the samples directory. It's an explanation of the history behind LZH, and has a brief explanation of several of the algorithms we've explored during this course. You should be able to read and understand all of it by now. If you're having trouble following one of the sections, go back and read the lesson that covered it again. Sometimes this stuff takes a while to soak in, and this file may help to point out the weak spots. Perhaps the most interesting part about it is the stuff that ISN'T said. The LZSS implementation from last week began as a simple compressor (no binary trees) back in 1988. Recall that LZ77 was introduced in 1977, and LZSS was published in 1982. Everybody was so hung up on LZW that they fully ignored LZSS until Haruhiko Okumura wrote some code and starting blowing the doors off existing software. Anyway. It all began with LZSS. -=O=- LZARI and LZH The first attempt at improving compression involved arithmetic coding. This was a natural choice, given that it's (theoretically) the best possible encoding method. Problem was how to add it. Encoding the free characters and the pointers separately made sense, because they really have no relation to each other. If a simple frequency encoder can reduce the size of English text by 20-30%, then the amount of space used by the individual characters should be reduced by a similar amount. (Well, okay, the results would vary because you are no longer compressing English text, but rather the stuff that didn't match a previous string. This would likely cause words with rare letters to end up as individual characters more often than common words, resulting in a flatter probability distribution, and less compression. I don't have exact figures.) Attempting to adaptively compress all 65536 possible two-byte pointers would be insane (imagine how small the probability intervals would be!). In fact, trying to adaptively compress all 4096 buffer positions would be pretty crazy. The solution used in LZARI is somewhere between brilliant and deranged. The free characters and the match lengths were encoded adaptively with the a single frequency table, by using "character" 253+length to represent the match length (where length goes from 3 to F). He used F=60 to get longer matches (remember, we no longer have to worry about things being an integral number of bits, so we don't need to stop at 18 characters). The position was encoded with a static-frequency model, which assumed that matches were most often found in the most recently seen data. The program comments point out that this is probably sub-optimal. But it worked. So. When you want to output a lone character, you call EncodeChar() with a value between 0 and 255. When you want to output a pointer, you call EncodeChar with a value between 256 and 256+F-THRESHOLD, and then EncodePosition with a value between 0 and N-1. When decoding, you ask for the next character, and check the value. If it's >= 256, then what follows is a pointer; otherwise you just about the value as a character. The neat part here is that the extra bit we had to use to tag characters and positions got swallowed up by the arithmetic encoding. If there are more pointers than free characters, the extra information needed to tell the difference will be less than one bit for pointers and more than one bit for characters. (This is the beauty of a good encoder.) -=*=- As we saw back in lesson 6, arithmetic encoding can be *very* slow. So, a guy named Haruyasu Yoshizaki changed the LZARI sources to use Huffman encoding instead. The result, which he called LZHUF but is more commonly known as LZH, compressed almost as well as LZARI but in a reasonable amount of time. The method used to encode characters and match lengths is identical, but uses adaptive Huffman. For the buffer offset, the upper 6 bits are encoded with a static Huffman tree. It's the same general idea, but it can be encoded and decoded with a simple table lookup. The lower 6 bits are sent without any encoding. The adaptive Huffman routine should look familiar; it was the basis of the C code presented back in lesson 5. -=O=- LZB and binary tricks And now for something completely different. If you turn to page 222 in the Good Book (i.e. Bell/Cleary/Witten) you will see a description of something called "LZB". Basically, instead of Huffman codes, you use simple variable-width integers. What's a variable-width integer? Glad you asked. The variable-width part just means that the value is expressed in a variable number of bits. Huffman codes fall into this category. Recall that Huffman codes have a "unique prefix" property, so that you can always tell where the code ends because it follows a unique path through a binary tree. What we need is a way to encode values such that we can tell where the end of the integer is without needing auxiliary data structures. The simplest example is unary coding: value binary coding 1 01 2 001 3 0001 4 00001 5 000001 16 00000000000000001 64 [...] 256 [...] A bunch of zeros equal to the value, followed by a '1' stop bit. However, expressing an arbitrary value between 0 and 4095 would tend to be wasteful. So, we consider some other ideas. These come from Appendix A in Bell/Cleary/Witten, which also includes some other neat ideas. (Check out the start-step-stop codes.) First off, consider binary coding with the leading zeros removed: value binary coding 1 1 2 10 3 11 4 100 5 101 16 10000 64 1000000 256 100000000 Note that they all have leading '1's. Now we remove those: value binary coding 1 --- 2 0 3 1 4 00 5 01 16 0000 64 000000 256 00000000 Problem is, we don't know how long these are. So, we encode the length with the unary coding we looked at first (colons added for clarity): value binary coding 1 1: 2 01:0 3 01:1 4 001:00 5 001:01 16 00001:0000 64 0000001:000000 256 000000001:00000000 This method is actually floor(log i) zeros, followed by a 1, followed by the binary code without the leading 1. Since the first part is a unary count, there is one '0' for every bit in the second part. This means we can intersperse the data bits with the zeros, like this: value binary coding 1 1 2 001 3 011 4 00001 5 00011 16 000000001 64 0000000000001 256 00000000000000001 This has the same length, but can be implemented with a simple shift routine (assembly: shift left, if carry set then exit, else shift the next bit into the integer we're forming). There are other variations, which do things like use the above codes to store the length, or have several stages of codes. -=*=- LZSS starts out with an empty buffer, and gradually adds characters. If we've only got 16 characters in the buffer, then the match position must occur at offsets 0-15, so we only need 4 bits to store the value. The codes grow longer as the dictionary grows. LZB uses this method to store the buffer offset. Clearly this only has an effect for the first N bytes of data (where N is the size of the ring buffer), but we can now choose N=8192 or even N=32768 and pay no penalty on files smaller than 8 or 32K. To store the match length, LZB uses the last variable-width integer scheme examined in the previous section. Since there is no fixed size, there is no longer a need for a parameter F. The match can be as large as the buffer (or larger if you want to play games with the overlap feature of the decoder). The absence of F and the changing size of the buffer lead to an interesting problem: what is the value of "THRESHOLD", i.e. when is a match long enough to be worth coding as a pointer instead of a series of characters? Before you shout "three", keep in mind that an LZSS pointer for a two-byte match is 16 bits + 1 tag bit, whereas two characters would take 16 bits + 2 tag bits... we're actually wasting one bit each time we ignore a two-byte match in LZSS. (It does make things go faster though.) Since this method still uses the tag bit from LZSS to determine the difference between pointers and characters, the minimum size of a pointer is 2 + k bits, where k is ceiling(log n). Back up a step. We need 1 bit to tell it's a pointer. The minimum match length is 1, and we represent that with 1 bit. "n" is the current value of N; if there are 7 characters in the buffer, log(n) is 2.x, so we let k=3. Thus, the minimum size of a match is 2+k bits. Bell/Cleary/Witten found that the best value for p (the minimum useful match length) is around floor((3+k)/8)+1. So if n <= 4096, the value would be p=2, meaning that matches of length 2 or more should be encoded as a pointer. (consider: 12 bits of position + 5 bits for the length + 1 bit for the tag = 18, and two characters + 2 tag bits = 18 bits. You start losing ground when matches reach 10 characters (8+p), but for English most matches are short anyway.) There isn't a definite "best value" for p, but it can be stored in a table with log(N) entries and adjusted based on experimentation. As with LZSS, the threshold is subtracted from the match length, so that the lengths start at zero. Whoops, can't do that: we have no way of expressing zero with our number scheme (how very Roman). Instead, we start at 1, so lengths are stored as (length - p +1). -=*=- The neat part about LZB is that it gets compression roughly equal to LZH and LZARI, but you don't have to have all the icky Huffman and arithmetic encoding junk around. In fact, the results in the back of Bell/Cleary/Witten indicate that LZB is the best performing of the LZ77 variants. In terms of encoding speed, LZB will be considerably slower than LZSS, because it has to do a lot of bit shifting. Also, it will be comparing potentially much longer strings, and probably more of them if we increase N as well. On the other hand, it doesn't have to do all of the fancy adaptive stuff that LZH and LZARI have to, so it will likely be faster than those. For decoding, LZB will be doing the same things LZSS does, but will have to do a lot of bit-shifting as well. The reduction in speed will certainly be worth the improvement in compression. Of course, LZB should decode considerably faster than either LZH or LZARI. From a practical standpoint, it should be obvious that reserving part of the ring buffer as a lookahead buffer is no longer practical... the size of the lookahead buffer was equal to F. Instead, we have to use a separate buffer to hold the data as it is being read, or just ungetch() the first character which doesn't match. It's interesting that so few people use this. Probably because they don't see how it could be better than LZH. After all, it isn't using "optimal" coding on the match lengths, it still has the extra tag bits, and it makes no attempt to compress the free characters. -=O=- PROGRAM(S) There is no assembly code. As far as I know, there are no (working) assembly-language implementations of LZH, LZB, LZARI, or even plain arithmetic encoding for the IIgs. The reason being that they are a lot of work, especially if you want to make them fast. I have provided C code with minimal changes made for LZH and LZARI; I don't know of any PD sources for LZB. I believe LZB is used in the MS-DOS archiver "ARJ", but the author only provides source for the decompressor (with the explicit limitation that the code can not be used in another archiver program... what ever happened to the hacker ethic?) The original LZH code had some implementation-specific code in it (it only worked on 16-bit architectures); I fixed it up a bit. Other than that, the code is pretty much unchanged. It's all rather straightforward, especially since you've already seen all of the pieces before. This stuff just puts it all together. NOTE: these programs can be a bit slow. Compressing MoriaGS with LZH (from 577K to 227K!) took 43 minutes on an accelerated IIgs. It only took 5 minutes to uncompress. -=O=- DEEP THOUGHTS - LZARI uses arithmetic encoding, which should do better than a Huffman encoder and much better than simple unary encoding. This does not appear to be the case here; the differences are slight, possibly in LZB's favor. Why? - How could LZB's variable-width integer scheme be applied to RLE? To MTF? -=!=-
This document is Copyright by Genie and Andy McFadden