Hacking Data Compression Lesson 8 LZW By Andy McFadden GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie This week we delve into the famous LZW algorithm. It probably would've made more sense to deal with LZ77 variants first, but this is one of the more interesting algorithms and I figured everyone would want a chance to look at it early on. We'll catch up with LZSS in lesson 10. The week's lesson picks up where we left off after the discussion of LZ78 in lesson 7, so it would probably be a good idea to go back and read that part before you begin this one. REMINDER: LZW is patented by IBM and Unisys. Don't even THINK about using it in something you make money off of without asking for a license. -=O=- Difference from LZ78 Welch, Terry A., "A technique for high-performance data compression," IEEE Computer, Volume 17, #6, pages 8-19, June 1984. LZ78 outputs two symbols each time, the longest matching phrase and the first character that didn't match. This makes it easy to rebuild the dictionary while decoding; you just go to the tree entry for that phrase and add a new branch corresponding to that character. However, adding the extra 8 bits of data onto every phrase increases the size of the output. LZW was designed to eliminate the extra character. The important change is that only phrase pointers are output. This means it's no longer sufficient to start with only the null phrase in the dictionary as in LZ78 (chicken and egg problem). Instead, phrases $00 to $ff represent the set of single-character phrases. An important side effect is that instead of sandwiching the first character that didn't match between two phrases, we include it as the first part of the NEXT phrase. This causes some interesting problems when decoding because it's possible to encounter a phrase which is still missing the last character. -=*=- Time for an example. I'm going to use the string from page 227 of _Text Compression_, namely "aabababaaa". We start off with one entry in the table for each input character. This means that we could start with just 0=a and 1=b in the table and add new ones at 2, but I found this more confusing than helpful. I'm going to start the new phrase pointers at $101 ($00-$ff are used for the characters, and $100 can have a special meaning as we will see later). So, we start processing the input. We have phrase 0="a" and phrase 1="b". First we get an 'a', which happens to be in the dictionary. So, we get the next character, another 'a'. Since "aa" isn't in the dictionary, we set $101="aa" and output the prefix for 'a' (0). Here's where things get different. With LZ78, we would've output the 0 followed by the second 'a'. With LZW, we start the NEXT phrase with the LAST character, so we just output the 0 and hang onto the 'a' for now. Lo and behold, the phrase "a" is in the dictionary. Next character is a 'b', which isn't in there, so set $102="ab" and output the prefix for 'a' (0). Next we start off with the 'b' we just read. Next character is an 'a', so set $103="ba" and output the prefix for 'b' (1). Start off with 'a', next character is a 'b', which is phrase $103. Next character is an 'a', so set $104="aba" and output $103. Start with the 'a', find a 'b' (phrase $103), find another 'a' (phrase $104), and then another 'a'. Set $105="abaa" and output $104. Finally, we use the previous 'a', read an 'a' (phrase $101), and then hit the end of input. So we output $101, and are done. Here's a table to summarize (same as in _Text Compression_): Input: a a b ab aba aa Output: $00 $00 $01 $102 $104 $101 The phrase list looks like this: Phrase # Phrase How derived -------- ------ ----------- $00 a (initial) $01 b (initial) $101 aa $00 + 'a' $102 ab $00 + 'b' $103 ba $01 + 'a' $104 aba $102 + 'a' $105 abaa $104 + 'a' If you look carefully at LZ78 vs LZW, you will see that there is a greater amount of overlap between phrases, and therefore the phrase table grows more quickly. There are LZ78 variants which actually combine the last two phrases to form new phrases; this causes the phrase table to expand much more quickly. -=*=- Now let's decode the values we output. Again, our dictionary has 0="a" and 1="b" in it. The input stream will be: $00 $00 $01 $102 $104 $101 The first thing we get is $00, so we output an 'a' and set $101="a" plus whatever comes next (I'm going to write $101="a+"). Next up is $00, so we output another 'a', set $101="aa", and set $102="a+". Next we get $01, so we output a 'b', set $102="ab", and $103="b+". Next is $102. We output "ab", set $103="ba", and set $104="ab+". Now we get $104. Currently, $104 is "ab+" - we haven't figured out what the last character is yet! However, we know that the FIRST character of the next phrase is equal to the LAST character of the previous phrase; since both of those are phrase $104, we know that the last character of phrase $104 must be the same as the first character of $104. Therefore, we set $104="aba", output "aba", and set $105="aba+". Finally we get $101, so we output "aa", set $105="abaa", and set $106="a+". Now we're out of input, and are done. The funny situation is referred to as the "KwKwK" case, because it happens whenever the input follows that pattern (where K is a character and w is a string of characters). This arises because of the overlapping character in the middle and the "delayed" final character. -=O=- Other Considerations The first issue is what to do when the phrase table gets full. We can't just grow the tree without bound unless we've got unlimited amounts of memory available. There are two basic approaches. (1) Stop adapting. After the table is full, compression becomes static. This is fast and works reasonably well if the first part of the input is representative of the remainder of the input, but does poorly otherwise. (2) Clear the table. When the table fills, completely erase it, and start over from $101. There are a number of variations on (2). You could remove the oldest phrases or the least used phrases, but this requires fancy data structures and slows things down a lot. Recent versions of the UNIX "compress" command combine the two: compression becomes static when the table fills, but the table is cleared if compression performance worsens. The sample code for this lesson uses phrase $100 as an explicit "clear table" code (this is the "special use" mentioned earlier). It's not strictly necessary, but it makes the code easier to understand and debug. I also use two consecutive clear table codes as an end-of-file indicator. -=*=- As I mentioned in the previous lesson, you can get better results by encoding the phrase pointer in a variable number of bits. Since we already have 257 entries in our dictionary, we start with codes that are 9 bits wide. The size of the phrase table and the maximum width of the codes are directly related. Programs like ShrinkIt limit the codes to 12 bits, which means that there are at most 4096 entries in the phrase table (which includes the first 256 single-character entries and the $100 clear code). This is done for practical reasons, because the data structures holding the phrases are 5 or 6 bytes per entry, or 20-24K (quite a bit of space on an Apple ][+ or //e). Some programs have an explicit "change bit width" code (e.g. Nelson's _The Data Compression Book_). This is slightly silly; it's obvious that $3ff will fit in 10 bits while $400 will not. You may be able to save a few cycles by hardcoding some indices, however. Of course, if speed is your primary goal, then it would make sense to just do away with the variable bit widths altogether. Using fixed-width 12-bit output codes hurts compression but makes the compressed data easier to encode and decode. No program that I'm aware of has pointers wider than 16 bits (the default for UNIX compress). After a certain point you see diminishing returns. Also, a file that fills up a phrase table with 64K entries would have to be pretty large, so the extra table space needed would be a waste for most things. -=O=- Uses LZW is remarkably fast for an adaptive scheme (easily the fastest adaptive scheme in this course), and beats most of the others as far as compression performance goes. As we will soon see, the algorithm is very straightforward, making it ideal for a pure hardware implementation. As most of you should already be aware, LZW is the key to ShrinkIt and GS/ShrinkIt (the former clears the table after every 4K of *input*; the latter uses $100 to clear it when the table fills; both have an RLE pass immediately preceding the LZW). It's used in UNIX compress, which is used for just about everything. It's also part of the GIF (Graphics Interchange Format) standard. I don't know if the authors of commercial GIF converters have to pay a fee or not; feel free to ask them. One interesting application is v.42bis, the data compression standard for high-speed modems. They used a cute trick where they send a special code to toggle the compression off when it fails to make the data smaller. Both sides continue to update the LZW trees, so that if compression improves it can be turned back on. This way, you get the benefits of compression without having to worry about the transmission slowing down if the data being sent is already compressed. (Nelson's book has an interesting note: there was some debate over whether or not LZW was suitable due to the patent restrictions, so Unisys decided to charge a one-time fee of $25,000 per manufacturer for licensing. This was enough to protect Unisys's patent and (apparently) make everybody happy.) And, of course, it's part of HardPressed (obligatory self-promotion). -=O=- PROGRAM As usual, there are C and assembly implementations of a simple LZW compressor. I wrote it to be easy to read, and to be easy to compare the C and assembly versions. There are a number of optimizations which could be made, but those are left as an exercise for the reader. When you get right down to it, there really isn't all that much code here. Part of it is the relative simplicity of the LZW algorithm, and part is an elegant implementation. If you compare the C and assembly versions, you may find that it's actually easier to understand the assembly code, because LZW translates well to 65816. -=*=- The encoder and decoder are strikingly different. First, the encoder. LZW requires a tree with nodes that can have up to 256 children. That would require 4096*256 entries to represent entirely, which is unacceptable. We could use a dynamic tree structure, but that requires either space for 256 child pointers per node, or an expensive linear search. The usual solution is to use a hash table, finding entries based on the prefix and character, and probing after a hash collision. We will have at most 4096-257 entries in our tree (12-bit codes, minus the 256 single-character phrases and the clear-table code), so we can use any prime number larger than 3839. To keep collisions to a minimum, we use 5119. Each entry in the hash table has three values: "entry" (the value from $101 to $fff), "prefix" (the prefix represented by the parent node), and "chr" (the character we added to get here). This gives us a (roughly) 25K hash table. To conserve memory, we can overlay this with the tables used for decoding (the assembly version does, the C version doesn't). The assembly version uses three different arrays so that the same index register can be used to access each. The C version uses a struct. -=*=- The first thing the encoder does is clear out the table by setting the "entry" fields to zero. This marks the hash table entries as empty. The outer loop grabs a character, and falls into the main loop. This character becomes the initial prefix. Because of the way the implementation works, in a sense we're always running one step ahead of what we're working on... we won't output the prefix until after we've located the next. This is because we need to store (prefix,character) in the tree at the node BELOW the node for "prefix" for fast searches. (Did that make sense? From the node for the prefix, you take the path for that character to a child node. That child node gets (prefix,character) stored in it.) Once we have a prefix, we fall into the "input loop", which keeps grabbing characters until we find an empty spot in the tree or run out of input. When we find a blank spot in the tree, we create the new node (by filling in the entry in the hash table) and branch back to the main loop, with the prefix equal to the last character we found. The hash function is rather simple, but effective. The secondary probe is also nothing fancy; it just walks backward through the table. If we determine that the table has filled, we output the last character read (which would have become the prefix for the next trip around), and then output $100. Then we clear the table out, reset entry to $101, and branch back to the beginning. When we reach the end of the file, we want to make sure that there are two consecutive $100s at the end of the output. If we just output a $100, however, then the put_code call after "eof_hit" just output another one (take a look at the code to see what I mean). Outputting two more would make a total of four, which is wasteful, so we check for that and avoid it. Take a careful look at how "entry" is adjusted and what values are used for it when calling put_code(); if you don't get these right, the encoder and decoder can become out of sync. -=*=- The put_code() (or PutCode) routine deserves little attention. It's really nothing fancy; just outputs a variable-width code from 9 to 12 bits. It's similar to the routines we've seen before, though we can get away with a minor optimization: since the codes are always at least 9 bits, we know that we can output at least one byte every time. You could replace the routine with one that uses only fixed-width 12-bit codes, and get a slight increase in speed with a slight loss in compression. The speed increase is actually smaller than you might expect... the main routine spends a considerable amount of time running around probing through the hash table. -=*=- The decoder needs to reconstruct an identical tree, but doesn't have the searching problem that the encoder does. The input consists of phrase pointers ranging from $00 to $fff, which can be used as indices into an array. Thus the decoder data structure acts as both a simple array and a 256-ary tree. The decoder table entries have two fields, "prefix" (a pointer to the node's parent) and "chr" (the character added by this prefix). So "ab" might be represented by node $104, which contains ($61,'b'), meaning that you should output a 'b' after prefix $61 (which is simply the letter 'a'). As mentioned earlier, we are going to walk up the tree to grab the characters, but need to output them as if we were walking down. The example in the above paragraph illustrates the need to stuff the characters onto a stack: we want to output 'b' after everything in prefix $61. A fully degenerate tree, which you would get by compressing a file full of identical characters, could be 4096-256 =3840 bytes tall, so the stack must be at least that large. This is more space than we want to grab from bank 0, so we use a software stack instead. -=*=- We start off by initializing all of the various counters. There's no need to initialize the table, since the choice of which entry to use is driven by the encoded output. For this reason, it's possible to have more than one compressed representation of a file. For example, the encoder could send table clears at odd times, or use only part of a phrase (neither of which is likely to be helpful). If "entry" is $101, then we either just got a table clear code or are just beginning. So, we grab a character to start things off. Since we know that the first code is always < $100 (unless it's a table clear), we just output it immediately. Each pass through the loop, we get another phrase pointer with get_code() (or GetCode). If it's a table clear, we reset the indices and go back to the start of the loop. If it's larger than the next available entry in the table, then something went wrong. If it's exactly equal to the next available entry, then we've encountered the KwKwK case, so we push on the last character from the previous pass through the loop (this is equivalent to filling in the "+" in the explanation). With that out of the way, we begin walking up the tree, pushing characters on as we go. We know we've reached the top when the pointer becomes less than $100. Instead of pushing the last pointer on and then popping it off immediately, we just output it. (Note that we're outputting a pointer as if it were a character. In theory, this is wrong. However, the low 8 bits of the pointer are equal to the character, and the high 1-4 bits are always zero, so it works out.) After we've output the entire stack, we add the prefix/character combination to the table. Actually, we're adding the PREVIOUS prefix, because we didn't know what the last character was until this time through the loop. The CURRENT prefix isn't added until later. Thus, we're still running one step ahead of ourselves, which can be confusing. The nice thing about the decoder is that nothing special needs to be done at EOF. When we see two consecutive table clears, we can just exit. -=*=- The get_code() (or GetCode) routine is also rather unremarkable. It extracts variable-width codes from 9 to 12 bits in width. Since we're running ahead of ourselves, we have to use entry+1 instead of entry. Beyond that, it's just a matter of getting all the bytes needed, shifting them into place, and then ANDing off the unnecessary stuff. -=*=- That's about it for the code. It compares favorably to ShrinkIt in terms of speed, but remember that ShrinkIt also has to do a CRC and an RLE pass. Not to mention that P8 ShrinkIt is written in 65c02, so all the nice 16-bit stuff we were doing is much harder. If you want a better understanding of the code, try stepping through it with the examples we did earlier in the lesson. With some optimizing this can be made pretty fast. Have at it. -=O=- DEEP THOUGHTS - Is it possible to pre-load the phrase table with things like "if", "and", "the", "that", and so on? Would effect would it have? - Suppose you wanted to remove phrases from the table on an LRU (Least Recently Used) basis. How would you communicate this to the decoder? What would have to change in both the encoder and decoder? - Why doesn't the decoder have a problem with KwJwK or KwKwJ (where K and J are different characters)? - Is it possible to remove the stack from the decoder? - Does the code work correctly for an empty file? -=!=-
This document is Copyright by Genie and Andy McFadden