Hacking Data Compression Lesson 7 Ziv and Lempel By Andy McFadden GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie There is NO source code for this week's lesson. If you're only after sample code, you can stop reading now. This lesson is an introduction to the algorithms on which most recent compressors are based. They are known as LZ77 and LZ78, referring to the papers published by Ziv and Lempel in 1977 and 1978, respectively (the transposition of 'L' and 'Z' was made early and propagated widely). They form the basis of almost all modern adaptive dictionary compression techniques. They are also patented to death. All LZ78-class algorithms and many LZ77-class algorithms are the victims of software patents, meaning that if you use them you will have to pay the patent holders or face a court challenge. Many of the patents are based on very rocky soil, but few of us can put up enough money to fight them in court. Most of the patent holders will not go after non-commercial uses of patented algorithms, which is why we have programs like ShrinkIt and UNIX compress. However, if you want to use them in a commercial product (e.g. HardPressed), you will have to pay. -=O=- The story so far Just to make sure we're all up to speed... - Dictionary compressors replace a symbol or collection of symbols with an index into a dictionary. How the symbols are chosen and how the index is represented in the output are implementation-dependent. See Lesson 3 for a more in-depth discussion of dictionary models. - Adaptive dictionary compressors start out with no knowledge or with a small static dictionary, and then expand the dictionary with pieces of the input as they work. See Lesson 5 for a discussion of the merits of adaptive models. We've already seen one example of an adaptive dictionary compressor, the MTF (Move To Front) compressor from Lesson 3. However, that only worked with whole words, and had some complications. -=*=- What makes these algorithms valuable is that they combine the advantages of an adaptive model with the speed and efficiency of dictionary models. Not will LZ78 compress better than your average semi-adaptive Huffman compressor, it will do it much more quickly. What is surprising about these schemes is that they are simple in both concept and implementation. -=O=- LZ77 Jacob Ziv and Abraham Lempel, "A universal algorithm for sequential data compression", IEEE Transactions on Information Theory, Volume 23, Number 3, May 1977, pages 337-343. LZ77 works by treating the input file as the dictionary. Instead of encoding an index referring to a dictionary full of strings, it encodes a triple <off, len, char>, where "off" is the file offset at which the string starts, "len" is the length of the match, and "char" is the first character that didn't match. For efficiency reasons, most LZ77-based schemes use a "sliding window" approach. The compressor keeps track of the last N bytes of data (usually 4K or 8K) in a buffer, sliding it along one character at a time, adding the next character at the front of the buffer and removing the character at the end. The buffer looks something like this: 0 1 2 3 4 5 6 || 7 8 9 a b b a c a c b a c The characters on the left have already been compressed. The characters on the right do double duty as a lookahead buffer (which can be useful, as explained later) and as the string to compress next. The number of bytes in the lookahead buffer is referred to as 'F', so the first N-F characters in the buffer have already been encoded. When compressing, LZ77 searches through the buffer to find the best match. This is just a simple string comparison, so a simple-minded approach would roll through and compare every string in turn (i.e., compare F bytes starting from offset 0 with the bytes in the lookahead buffer, then compare F bytes starting from offset 1, and so on). When it finds the best match or runs out of buffer, it outputs the buffer offset at which it was found, and how many characters it matched. After compressing a string, we shift everything left by "len" characters, and grab another "len" characters from the input. It should be clear that the longest string you can match is equal to F, the size of the lookahead buffer. (This is not a limitation of the family of algorithms - you can do away with the lookahead buffer entirely if you want - but it's the way LZ77 works.) To correctly deal with the case where NOTHING matched (perhaps a character seen for the first time), the first character that didn't match is also output. There are ways to avoid doing this, which we will see in a later lesson. All the decompression routines need to do is copy "len" bytes of data at the specified offset to the output, and tack on the extra character. This makes extraction very fast, since it's really nothing more than a memory copy. (I'm being somewhat imprecise here; what's important is to see that its dictionary is nothing more than the last N-F bytes of input.) -=*=- The neat thing about the lookahead buffer is that you can do something reminiscent of an old Apple II monitor memory fill technique. Suppose you have this situation: 1 2 3 4 5 6 7 8 9 a b c x || x x x x y The string comparisons are allowed to overlap into the lookahead buffer, so long as at least the first character is outside. So to compress the string "xxxxy", you would output <4, 4, 'y'> because the first match occurred at 4, it ran for four bytes, and the first mismatch was 'y'. When this is being extracted, the decompressor will see something like: <?, ?, 'x'> <4, 4, 'y'> So it will already have the 'x' at position 4. It then copies the 'x' from position 4 to position 5, then copies position 5 (which is now an 'x') to position 6, and then copies position 6 (which is also now an 'x') to position 7. Position 7 gets copied to position 8, and finally it sticks the 'y' into position 9. It's not immediately obvious why this works in general; after all, we're matching part of the string with itself! The trick is that we are matching the LATER pieces of the string with the EARLIER pieces of the string, and the EARLIER pieces of the string with stuff that we've already compressed (or extracted). -=*=- There are some other implementation details to consider. First, the sliding window. Shifting characters over by 1 every time we need more input is rather slow. The usual solution is to use a circular buffer, and then treat all values mod N. For this reason, the buffer size is usually a power of 2 (so you can do a bitwise AND to do the modulus). Second, we could really use some auxiliary data structures to speed up the searching. This doesn't have to be terribly complicated. One commonly used technique is to build a binary tree with N-F entries in it (one for every encoded string), each of which is F bytes long. This sounds like it uses a ton of memory, but it actually doesn't, because each entry in the tree is just an offset into the sliding window. Nodes are added and deleted as the window shifts. Third, we haven't said what's in the buffer at the outset. It actually doesn't matter so long as the encoder and decoder have the same picture. Traditionally it is filled with either spaces or zeros. -=*=- This is one of the more intuitive models we've seen. Variations on this allow the string to start at any position in the file and go on for any length, but this requires some fancy encoding techniques and may require infinite amounts of memory. The neat thing about this is that it will grab whole words or even groups of words that have been used in the recent past. The penalty for encountering new words is slight, since many words can be broken down into smaller words that have been seen previously. This, plus the slick RLE-like nature of the lookahead buffer (which will work for pairs or groups of symbols, making it almost as effective as PackBytes), contributes to a very effective model. The main drawback to LZ77 is the compression speed. This can be reduced with fancy hashing techniques, but there's no way to avoid doing a lot of string comparisons. From a theoretical viewpoint, the failing of LZ77 is that it forgets about data older than 4K or 8K. Circumventing this problem not only requires large amounts of memory, it also implies doing string comparisons against the entire file, so the time required increases linearly. Also, the fixed maximum value of "len" gets in the way, because it limits the maximum compression attainable to "len" divided by the size of the <offset, size, char> triple. From a practical standpoint, the method of encoding the dictionary index is lame. If your first five characters are "abcde", then you will have a series of triples that look like <0, 0, 'a'> <0, 0, 'b'>, and so on. LZSS removes the painful waste from the encoding, and LZH and LZB take things a step further by encoding the offset and length values more efficiently. -=O=- LZ78 Jacob Ziv and Abraham Lempel, "Compression of individual sequences via variable-rate coding", IEEE Transactions on Information Theory, Volume 24, Number 5, September 1978, pages 530-536. The LZ78 family of algorithms represent a completely different approach to the problem. It is interesting from a practical standpoint (because it compresses quickly), and from a theoretical standpoint (because the compression gradually becomes optimal). Instead of treating the input as a big wad of text, LZ78 breaks the input into "phrases", where each phrase is the longest matching previously seen phrase plus the first character that didn't match. That didn't make much sense. Let's try an example (I'm going to pull the one out of Bell/Cleary/Witten so I don't blow it and confuse you worse). Suppose you have the input string "aaabbabaabaaabab". Kinda boring, but hey. The LZ78 compressor gets the string one character at a time. First, it gets 'a'. It doesn't have any phrases in its dictionary, so it says "1 = a" and outputs <0, 'a'>. This means, "when decoding, take phrase 0 (the empty phrase), and tack on an 'a'." The decoder automatically assigns the result of this to phrase 1, which keeps the decoder in sync with the encoder. Next, it gets another 'a'. It has a phrase in its dictionary which matches it, so it goes and grabs the next character, another 'a'. It assigns "2 = 1+a", and outputs <1, 'a'>, since we got phrase #1 followed by an 'a'. Next character is a 'b'. We haven't seen this before, so set "3 = b" and output <0, 'b'>. Next up is another 'b'. Matches phrase 3, so grab the next character, an 'a'. We don't have "ba" in the dictionary, so set "4 = 3+a" and output <3, 'a'>. Now we're starting to roll. So far we have: Input: a aa b ba Phrase number: 1 2 3 4 Output: <0,a> <1,a> <0,b> <3,a> (Note that "ab" is NOT a known phrase, even though it has appeared in the input. Thus LZ78 doesn't catch ALL substrings like LZ77 does.) Next character is a 'b', phrase 3. Next one after that is an 'a', phrase 4 (NOT phrase 1... we're looking for phrase 3 followed by an 'a', not phrase 0 followed by an 'a'). Next is another a, but we haven't seen "baa" before, so set "5 = 4+a" and output <4, 'a'>. (The decoder will know that 4 = 3+a, and that 3 = 0+b, so it will output "baa".) Next character is a 'b' (phrase 3), followed by an 'a' (phrase 4), followed by another 'a' (phrase 5), and then another 'a'. So we set "6 = 5+a" and output <5, 'a'>. This means we just replaced the string "baaa" with the pair <5, 'a'>, so our compression rate is improving. Finally, we get a 'b' (phrase 3), an 'a' (phrase 4), and a 'b' (not seen yet), so we set "7 = 4+b" and output <4, 'b'>. The final results look like this: Input: a aa b ba baa baaa bab Phrase number: 1 2 3 4 5 6 7 Output: <0,a> <1,a> <0,b> <3,a> <4,a> <5,a> <4,b> A couple of things should be pointed out. First, this is cutting the input up into little pieces, so if your next input string is "baabaaa", you will end up with a new phrase for "baab" (5+b), whereas LZ77 would have been able to match the whole thing. On the other hand, there is no limit on the length of the phrases. As you can see, they slowly get longer as we process more and more input. There is no limit on how far back a phrase can reference. When we fill up the memory we have available, we can simply clear the table and start compressing again with only the null phrase (phrase 0). From a practical standpoint, LZ78 has MUCH faster compression, because the phrases can be represented by a "trie" data structure. Each node in the trie represents a particular phrase. There is one child for every character which can follow that phrase. (Note that it is NOT necessary to have one child for every POSSIBLE character which can follow; only those we have already seen.) The trie for the above sequence would look like this: 0 a / \ b / \ / \ 1 3 a / a / / / / / 2 4 a / \ b / \ / \ 5 7 a / / / 6 (This is a binary tree because there were only two different characters in the input. In practice there can be up to 256 children of a given node. Note that we start out with just node 0, and gradually add paths and nodes as we go. This can be done efficiently, as we will see later on in the LZW lesson.) When considering the input string, we start at node 0, and traverse the appropriate branch at each step. So if we got "baab" next, we would start at 0, take the 'b' branch to 3, take the 'a' branch to 4, take the 'a' branch to 5, and then try to take the nonexistent 'b' branch. When we reach a node that doesn't have a branch for the next character, we create a new node and a branch to it, then output the number of the node we stopped at and the character we stopped with. This enables the decoder to reconstruct an identical tree. For the example in the preceding paragraph, we would take the last node we went through (5) and add a 'b', so we'd output <5,b>. When decoding (recall that the output consists of <phrase, char> pairs), we start at the node specified, and work our way up, pushing characters onto a stack. When we reach the root (phrase 0), we dump the contents of the stack, followed by the character. (If you're Not Quite Getting It, try working through the previous example, but by building the tree as you go (hey, if you've got the lesson printed out, it's a good excuse to scribble all over it). It's a lot easier to understand if you watch the tree being built. It is VERY important that you get LZ78 straight in your head, because LZW takes some odd turns and can be very difficult to follow.) -=*=- One important consideration is how the phrase number will be encoded. Since there is no limit on the size, we need an efficient way to encode it. The solution is to encode it in a variable number of bits; recall that a number up to P can be represented in floor(log2 (P)) bits (which is a fancy way of saying that you can go from 0 to 255 with 8 bits). So, you start out with one bit. After phrase 1, you use two bits. After phrase 3, three bits. After phrase 7, four bits, and so on. The size is automatically expanded by both the compressor and decompressor, so there is no confusion over how many bits to get. The character is simply dumped into the output. This is wasteful, but is corrected by LZW. -=*=- As you can see by thinking about how the tree is built, there is no limit on the length of the phrases other than what happens when we run out of memory. It should also be clear that each new phrase is made up of an old phrase plus one character (which is what I was trying to say back at the very beginning, only it made less sense then). From the example we did, you can see that construction of the dictionary proceeds more slowly than it did for LZSS. It doesn't contain the COMPLETE set of strings seen so far. However, it does contain a UNIQUE set of strings seen so far - each addition to the table is a previously seen prefix plus one character which combine to form an as-yet unseen string. From a theoretical standpoint, the compression is "asymptotically optimal" as the size of the input increases, meaning that an indefinitely long input will be compressed down to a size very close to its entropy. Unfortunately, most inputs are considerably shorter than infinity, and you'd have to compress everything ever written before you got close. (Of course, you'd probably run out of memory first...) There's a restriction on the previous statement which says that the source input must be "ergodic", which means that "any sequence it produces becomes entirely representative of the source as its length grows longer and longer." Whatever. It just means that if you have nothing but 'a's for 2 gigabytes, you're going to get a big dip in compression (and a discontinuity in your asymptote) when you start hitting 'b's. Most kinds of input, especially text, are ergodic, so this is a pretty weak restriction. (Don't worry about this much; it's an interesting result, but not useful.) LZ78 is the basis of LZW, which is used in many different places (ShrinkIt, GIF pictures, v.42bis for modems, etc). LZW will be discussed in depth next lesson. -=O=- Relation to statistical schemes As I mentioned in passing a while back, any dictionary scheme can be converted into an equivalent statistical scheme. There is a paper out there somewhere which examines LZ78 from this standpoint. By assigning probabilities to the branches of the tree and to future events, it can be shown that LZ78 does poorly near the start of the input when it has no information in the tree. This corresponds to what we saw in practice: LZ78 did poorly at first, but seemed to be gathering steam as it went. It can be interesting to try to combine something like LZW with either a statistical model or an LZ77 model until it gets going. The results aren't all that great, but it's something to do if you're bored. -=O=- PROGRAM Hey, I already said there wasn't one. -=O=- DEEP THOUGHTS - What happens when you run these backwards? (If you're having trouble picturing it, assume a full dictionary for LZ78 and a buffer full of stuff for LZ77. Then start feeding random bits to the decoder.) What effect does the extra character added after each offset/phrase have? - I previously stated that bigger dictionaries == better compression (back in Lesson 3 or so). Does this hold true for LZ77? What about LZ78? (Hint: consider the uniqueness property of LZ78.) - When decoding LZ78, the characters are pushed onto a stack as the tree is traversed. How big does the stack need to be? -=!=-
This document is Copyright by Genie and Andy McFadden