Hacking Data Compression Lesson 10 LZSS By Andy McFadden GEnie A2Pro Apple II University - Copyright (C) 1993 GEnie Not much theory in this lesson. Most of what you need to know was covered in the description of LZ77 back in lesson 7. The sample code can be a bit daunting at first, so that will be covered in detail. Next week's lesson covers three variations on LZSS, namely LZH, LZARI, and LZB. These three represent the current state of the art for general-purpose file archivers. You can usually do better with a statistical scheme (which weren't discussed in depth in this course), but those can be tens or hundreds of times slower. [ NOTE: it's well past the time to put the lesson up and the assembly code still doesn't work. It will have to wait for another time. An executable built from the C code has been provided. ] -=O=- Overview Storer, J.A. and Szymanski, T.G., "Data compression via textual substitution", Journal of the Association for Computing Machinery, vol 29, issue 4, October 1982, pages 928-951. LZSS does for LZ77 what LZW does for LZ78: removes the extra character. (If you don't remember how LZ77 works, go back and read the first part of lesson 7!) LZW's way of doing this - outputting nothing but pointers - is not practical for LZ77. Recall that the LZ77 dictionary is the last few thousand bytes of input (typically 4K or 8K). There is no way to guarantee that all 256 characters appeared recently, so that approach won't work. Instead, LZSS (named after the two authors of the paper) uses a free mix of pointers and characters. They are differentiated by sending an extra bit (1 for characters, 0 for pointers) with each. Typically, the encoder will send 8 of the flag bits in a single byte, followed by eight byte-aligned pointers or characters; this avoids costly bit shifting, making the encoder and decoder significantly faster. That's basically it. -=*=- Let's talk tree. Recall from lesson 7 that LZ77 tries to match an input string against every position in the sliding window (usually implemented as a ring buffer). Doing a few thousand string comparisons is not a very efficient way to do things. A common solution - but not the only one - is to build sorted binary trees with the strings in them. If we define the maximum size of a string as F (where F is a constant, not hex for 15), then we don't need end-of-string markers, and can define a string as simply a position in the ring buffer. Suppose the ring buffer holds "abcdefghikjlmnopqrstuvwzy". String #0 is "abcdefghijklmno", string #1 is "bcdefghijklmnop", and so on. There is a blank spot for the F entries occupied by the look-ahead buffer, since strings can't start inside the buffer. However, they CAN wrap around the ends of the ring buffer (by definition of "ring" buffer). While we're here, let's define a parameter for the size of the ring buffer, call it N. N is usually a power of 2 between 4096 and 32768. The number of strings we will compare against is therefore N-F. Now, we could make a great big tree with N-F strings in it, but that would leave us doing roughly log(N) comparisons... assuming the tree was balanced, which it won't be. An easy way to reduce the maximum height is to use 256 different trees, one for every possible initial character. Some implementations will use 4096 trees, choosing them by hashing on the first three characters. The latter is usually more efficient, because it avoids clusters for things like English text (with a small set of initial characters), but uses more memory. (There are many memory vs. speed trade-offs possible here.) We will add nodes to the trees as characters slide in, and delete them as the slide out (when the first character of the string gets stepped on by the lookahead buffer). Most of the yucky-looking stuff in the code is just insert/delete code for binary trees. -=O=- PROGRAM The C sources which *everybody* uses were written by Haruhiko Okumura, and can be found as "lzss.c" in the "samples10" directory. The code in the "samples10/C" directory had some minor modifications made to make it fit the standard class "shell". Incidentally, the fact that the original LZSS code and the class code take the same arguments on the command line is not coincidental. The code uses N=4096 and F=18. This allows the pointer's position to be represented in 12 bits (0-4095), and the string length to fit in the remaining 4 (3-18, offset by 3). Since the <pointer,length> pair occupies 16 bits, matches of length 1 or 2 are sent as individual characters, and match lengths can begin at 3. Some improvement in compression can be gained by using N=8192, but this requires using F=10, matches only on even boundaries, or the loss of the ability to store bytes on byte boundaries. Being able to avoid shifting bits into position speeds things up greatly, especially when extracting. -=*=- The tree is represented with three arrays, which give the left son, right son, and parent for every node. There is no "node" array; the ring buffer ("text_buf") provides that. Note that the right child has 256 more entries than lson and dad. These 256 entries represent the root nodes of the 256 binary trees. The tree initialization routine (InitTree) does two things: initializes the 256 root rson entries to NIL, and then clears all of the parent entries to NIL. A NIL dad pointer is used to indicate that the node is not part of the tree. -=*=- Starting with Encode()... First thing is to initialize the trees with InitTree(). Next is to set code_buf[0] to zero. code_buf is used to hold the output until we get eight items, where an item is a pointer or a character. Once we get the whole batch, we dump it all at once. code_buf[0] has the flag bits for the items, and code_buf[1] through code_buf[16] hold the actual data. code_buf_ptr holds our current position in code_buf; it advances 1 byte for characters and 2 for pointers. Next we set "mask" to 1. mask serves two roles: it is used to logically OR a bit into code_buf[0], and it tells us when code_buf is full. Since mask is a single byte, and we initialized it to 1, after 8 left shifts it will become zero (clever, no?) Next we initialize "s" and "r" (love those one-character variables). "s" is the leading edge of the lookahead buffer, and therefore points to the string which is about to be deleted. "r" is the trailing edge, and therefore points to the string which is about to be added (or was just added). We start with s=0 and r=N-F, which is a little weird since it means we will be starting at the "end" of the buffer. Next step is to initialize the ring buffer. It doesn't really matter what we fill it with so long as both encoder and decoder use the same values. Usually spaces or zeros are used. Now we're ready to read some characters in. The idea is to read F characters into the buffer, starting at position r and not quite reaching s. We have to be careful though, because there might not be F characters in the file. "len" holds the actual number of characters we read. (In general, "len" represents the number of characters left in the file, but is at most F... so, when there are more than F characters left, "len" is ignored. It makes more sense if you read the code.) Now we need to insert the strings into the tree. Since the decoder can overlap into the look-ahead buffer (see lesson 7 if you don't remember), we are going to insert F different strings into the tree. The first is a space followed by 17 characters, then two spaces followed by 16 characters, and so on. The source code claims that inserting them with a gradually increasing #of spaces reduces the chances of forming degenerate trees; I'm not entirely convinced this is the case, but what the heck. Finally, we insert the full string we have read. -=*=- One small item I have neglected to mention up until now is that the InsertNode() routine does more than just insert the node in the proper position in the binary tree. It turns out that searching for string matches and searching for the node's position are the same operation, so we do both at once. Two variables, "match_length" and "match_position", are set by InsertNode. These tell us how long the match was and where it occurred. That last InsertNode(r), which added all F characters we have read from the input, did this: (1) search for a match. Unless it began with a space, it didn't find any. On the other hand, if it was 18 spaces, then it would've matched all 18 characters. It appears at first that it would be impossible to extract this - the decoder has no knowledge of the contents of the file - but keep in mind that the decoder also starts out with a buffer full of spaces, and don't forget the "memory move" fill technique mentioned in lesson 7. (2) insert the string. This amounts to adjusting lson, rson, and dad appropriately (exact details discussed later). Since the string is added to the tree after the search, there is no possibility of the string matching against itself. With the initial match_length and match_position in hand, we fall into the main loop. -=*=- First thing to do is to make sure match_length isn't huge. Remember than "len" is either F or the number of characters left in the file, so it's impossible to match more than len characters. However, InsertNode may get confused at the end of the file, because the last few characters in the lookahead buffer aren't valid. Next comes the important part in the life of a pointer: determining whether the match_length is large enough. It's compared against a constant called THRESHOLD, defined as 2, representing the size of a pointer. If the match_length is <= THRESHOLD, then we gain nothing by using a pointer. In fact, storing a match of length 2 may actually hurt us, because the second character could be the start of a really good match. (In fact, we don't really know that accepting a match of length 3, 4, ... is the BEST choice. We're using "greedy" parsing, which means that we grab the first thing that looks good.) For characters, we just stuff the character into code_buf, increment code_buf_ptr, and OR "mask" into code_buf[0]. We set match_length to 1 to indicate that we have used one byte of input. For pointers, we encode the position and length into a single 16-bit word. It's done in sort of a weird way: the low byte is the low byte of the position, but the low nybble of the high byte is the length, and the high nybble of the high byte is the high nybble of the offset. (Got all that?) How about a picture (hi bit on the left): Length = llll (4 bits) Position = ppppqqqqrrrr (12 bits) Pointer = ppppllllqqqqrrrr (16 bits) It's a little faster in assembly to do llllppppqqqqrrrr because we're using 16-bit operations instead of 8-bit, but not by much. Once we output the appropriate thang, we shift "mask" left one bit and check to see if it becomes zero. If it does, we can dump the whole buffer full of stuff, and reset code_buf[0], code_buf_ptr, and mask. -=*=- Now we need to add match_length new characters into the lookahead buffer, since we just consumed them. We count down match_length, doing the following: (1) read a new character. We won't need it for a little bit, but we want to grab it now to test for end of file. (2) delete the string at position s. Recall that this is the leading edge of the lookahead buffer. (3) stuff the new character into the ring buffer at position s. (4) if the lookahead buffer wraps around the end of the ring buffer, we also stuff the new character in past the end of the ring. This makes string comparisons faster because we don't have to have them wrap around the ring. (5) s and r are incremented by one, modulo N. N doesn't have to be a power of 2, but if it isn't then the modulo operation becomes costly. (6) insert the node at position r. Recall that this is the trailing edge of the lookahead buffer, and therefore the string we just inserted is the F characters found there. The Fth character is the one we just read. Keep in mind that s and r were incremented between placing the character at s and inserting the string at r... the order of operations is important. This is repeated until the lookahead buffer is full again. -=*=- At the very bottom of the main loop is a quaint little while loop which only kicks in at the end of the file (i+1 will always be equal to last_match_length unless EOF was encountered). This routine does the required InsertNode() and DeleteNode() calls without attempting to get any more characters. The result is that the last part of the lookahead buffer will be invalid, so "len" is decremented at every step. If we do all the node shifting needed but haven't reached the end of the input, we go around again. When len finally bottoms out, the main loop exits. Below the main loop is a small code fragment which just outputs any pending data left in code_buf. Since all of the data is byte-aligned, there is no need to send the file's length or a distinguished stop symbol (in fact, we *can't* send a distinguished stop symbol without doing something like defining offset 4095 as EOF, and then avoiding it like the plague). The decoder can just stop when it reaches the end of the file. -=*=- Now, back to InsertNode and DeleteNode... These aren't as bad as they look. InsertNode starts off by setting "cmp=1". This ensures that the search loop follows the right child first, which is a good thing to do since the 256 tree roots only have right kids. Next it sets key = &text_buf[r], which is just a pointer into the string buffer. The string pointed to by "key" is the string we're adding; it's just done so we don't have to keep typing "text_buf[r]" all the time (it's also faster). Next comes p = N + 1 + key[0]. p is a pointer to the tree root; rson[p] is the first node in the tree. Remember that we pick from the 256 trees based on the first character of the string... key[0]. Now we set rson[r] = lson[r] = NIL. Well, we're inserting node "r" into the tree, so it can't have any kids right now, and it makes it easier to clear out the pointers ahead of time. Finally match_length = 0. The reason for this should be obvious: we haven't matched anything yet. Now we drop into the body of the loop. "cmp" will be the result of a strcmp() operation; it is 0 for equal, 1 if one string is larger than the other, -1 if the other string is larger. Since we forced it to 1 up front, we will search down the right child. The current node is "p", which right now is the root of the tree. If the right child of the current node (rson[p]) is NIL, we've hit the end of the tree, and can just insert node r right there. We set rson[p]=r, dad[r]=p, and we're done. If not, we walk down the tree by setting p=rson[p]. An identical set of operations is performed if we went down the left branch instead (the "else" clause). Now comes the string comparison. We know that the first character matched, or we wouldn't be in this particular tree. So we compare from position 1 to position F-1. If the characters at some position don't match, we break out with cmp set positive or negative depending on which came first ASCII-alphabetically. "i" will have the number of characters which matched. If "i" is larger than the previous match length, set match_position to the current node (p). If the match_length is equal to F, then we aren't going to find a better match anywhere, so we break out of the loop. This repeats until we walk off the end of a tree or we match all F characters. If we walk off the end, we add the node and exit. Otherwise, we keep going left or right depending on the outcome of the string comparison. (This is why degenerate trees are bad... you keep going, and going, and going...) If we matched all 18 characters, we still have a little bit of work to do. If the file were full of an 18-character string repeating over and over, we would eventually have a VERY unbalanced binary tree with N-F copies of that string. Which is fine if you're looking for that particular one, but if you want to find OTHER strings with the same first character then you get to search through the whole degenerate list. To avoid this, we only store one copy of the string in the tree. Since the one currently in the tree will be deleted sooner than the one we're adding now, we just replace the old one with the new one, and set the pointers for the old one to NIL. The really icky-looking stuff below the main loop in InsertNode does just that, and nothing more. Remember that "r" is the node we're adding, and "p" is the current node; both point to identical strings (in different places in the buffer). All it does is set r's dad to p's dad, r's lson/rson to p's lson/rson, then tells p's kids that r is now their dad, and finally tells p's parent that r is now its child (the if/else is to determine if p was the left child or the right child). It looks ugly, but it's nothing special. Draw a small tree and try it. Finally, dad[p] is set to NIL, which marks it as unused, so that DeleteNode won't try to remove it. -=*=- DeleteNode is equally icky-looking, but it's all pointer manipulation. Its job is to remove node p from the tree. First we check to see if dad[p] is NIL; if so, it's not in the tree, and we can exit immediately. Now we try to find a successor for p, named q. If p has only one child, then that child will become q. (If rson[p] is NIL, then set q to the left child. If lson[p] is NIL, then set q to the right child. It's okay for q to be equal to NIL, because we have defined NIL = N. This is why the array declarations use N+1 instead of N.) If neither child is NIL, we have some work to do. We need to replace p with a node that is larger than p's left child but smaller than p's right child. Since this is an ordered binary tree, all nodes on the left are smaller than the current node, and all nodes on the right are larger. Therefore, the node which falls between the left and right children of p is the rightmost descendant of the left child. (It would be equally valid to use the leftmost descendant of the right child.) The icky part starts out by assigning q = lson[p]. If q doesn't have a right son, we don't have to search any further. If it does, we walk q down the right pointers until we reach a node with no right child. There, we remove the node from the tree by replacing the pointer to it with a pointer to its left child (which may be NIL, or a large subtree... all we know is that it has no RIGHT descendants). That done, we set the pointers so that it becomes the parent of lson[p] (remember him?) This node becomes q. Now then, we're back to the place where p has two kids, but now we know that the right child of lson[p] is NIL (either it started out that way, or all the icky stuff we just did stuffed in a node with no right child). Now we can just set the right child of q to the tree which was formerly on the right side of p. With all that done, we know that p points to the original node and q points to the replacement, and that p's kids are being pointed to by somebody. Now we just remove p from existence by setting q's dad to p's dad, and pointing p's parent down at q (same old "if" for left/right child). The coup de grace is dad[p] = NIL, making it officially gone. I *strongly* recommend you draw a binary tree and remove a node. ASCII graphics would not come out well, so I'm not even going to attempt it. Use a tree which looks like: A / \ / \ B p <-- remove 'p' / \ / \ v w / \ / \ x y / / z Just walk through the routine and watch what it does. Replace v, w, and y with NIL and see how things change. You should end up with y replacing p, and z as a child of v. This is similar to the deletion routine in Sedgewick's _Algorithms in C_, though that one uses pointers instead of array indices. -=*=- Somewhere along the way, we might want to unpack the file. Extraction is a truly trivial operation. The input is a stream of pointers and characters; the characters are just copied out, and the pointers just give an offset and range to copy from the ring buffer. The only reason this is more expensive than a simple copy operation is that we have to update the ring buffer while we work. Decode() starts off by initializing the array to spaces. r is again set to N-F, but this time there is no "s". We set "flags" to zero to indicate that the next byte is a flag byte. The main loop starts off by shifting "flags" to the right. flags will be initialized to $ffxx, where "xx" is the byte read. That way, after eight logical right shifts the high byte will be zero (kinda like the trick used in the encoder). If flags is zero, we get a character from the input, and initialize flags with it. If we hit EOF, we are done, and can exit. Hitting EOF anywhere else is a bad sign, but we don't check for it in this implementation. The flag for the next item on the input stream is in bit 0 (the least significant bit, unless you spend too much time around IBM mainframes). If it's a 1, we read one character from the input, send it to the output, and add it to text_buf at position r. r is incremented, modulo N. If it's a zero, we read two characters, and decode them into <position,length> values. We then copy the characters from that range to the output and to text_buf at position r, incrementing r each time. That's it. We keep doing this until we run out of input. It is perfectly valid for <position> to be equal to r+1, which allows us to do the "memory move" fill stuff. -=*=- And there you have it. A very long description of what is really a very simple algorithm. It can be tricky to get all the details just right - having stray tree pointers or not getting enough characters can be tough problems to find - but it's not doing anything really hideous, despite the code's appearance. -=O=- DEEP THOUGHTS - We used greedy parsing, which can cause us to miss long matches which begin in the middle of short string. One modification is to use "lazy parsing", which is greedy, but doesn't output a pointer until it is sure that the current match is longer than the match found at the NEXT position as well. What would need to change to implement this? - Lazy parsing is one step closer to "optimal parsing", in which the best possible combination of pointers and characters is found. How hard would this be to implement? - Having a larger ring buffer and longer maximum match length will improve compression... usually. They will also increase the size of the <offset,length> pointer, which means that shorter matches will occupy more space because the bit widths are larger. Must they? -=!=-
This document is Copyright by Genie and Andy McFadden