[fadden-logo]

Back to index


Hacking Data Compression  Lesson 10  LZSS
By Andy McFadden
GEnie A2Pro Apple II University - Copyright (C) 1993 GEnie


Not much theory in this lesson.  Most of what you need to know was
covered in the description of LZ77 back in lesson 7.  The sample code
can be a bit daunting at first, so that will be covered in detail.

Next week's lesson covers three variations on LZSS, namely LZH, LZARI,
and LZB.  These three represent the current state of the art for
general-purpose file archivers.  You can usually do better with a
statistical scheme (which weren't discussed in depth in this course),
but those can be tens or hundreds of times slower.

[ NOTE: it's well past the time to put the lesson up and the assembly
code still doesn't work.  It will have to wait for another time.  An
executable built from the C code has been provided. ]

-=O=-     Overview

Storer, J.A. and Szymanski, T.G., "Data compression via textual
substitution", Journal of the Association for Computing Machinery, vol
29, issue 4, October 1982, pages 928-951.


LZSS does for LZ77 what LZW does for LZ78: removes the extra
character.  (If you don't remember how LZ77 works, go back and read
the first part of lesson 7!)

LZW's way of doing this - outputting nothing but pointers - is not
practical for LZ77.  Recall that the LZ77 dictionary is the last few
thousand bytes of input (typically 4K or 8K).  There is no way to
guarantee that all 256 characters appeared recently, so that approach
won't work.

Instead, LZSS (named after the two authors of the paper) uses a free
mix of pointers and characters.  They are differentiated by sending an
extra bit (1 for characters, 0 for pointers) with each.  Typically,
the encoder will send 8 of the flag bits in a single byte, followed by
eight byte-aligned pointers or characters; this avoids costly bit
shifting, making the encoder and decoder significantly faster.

That's basically it.

-=*=-

Let's talk tree.

Recall from lesson 7 that LZ77 tries to match an input string against
every position in the sliding window (usually implemented as a ring
buffer).  Doing a few thousand string comparisons is not a very
efficient way to do things.

A common solution - but not the only one - is to build sorted binary
trees with the strings in them.  If we define the maximum size of a
string as F (where F is a constant, not hex for 15), then we don't
need end-of-string markers, and can define a string as simply a
position in the ring buffer.

Suppose the ring buffer holds "abcdefghikjlmnopqrstuvwzy".  String #0
is "abcdefghijklmno", string #1 is "bcdefghijklmnop", and so on.
There is a blank spot for the F entries occupied by the look-ahead
buffer, since strings can't start inside the buffer.  However, they
CAN wrap around the ends of the ring buffer (by definition of "ring"
buffer).

While we're here, let's define a parameter for the size of the ring
buffer, call it N.  N is usually a power of 2 between 4096 and 32768.
The number of strings we will compare against is therefore N-F.

Now, we could make a great big tree with N-F strings in it, but that
would leave us doing roughly log(N) comparisons... assuming the tree
was balanced, which it won't be.  An easy way to reduce the maximum
height is to use 256 different trees, one for every possible initial
character.  Some implementations will use 4096 trees, choosing them by
hashing on the first three characters.  The latter is usually more
efficient, because it avoids clusters for things like English text
(with a small set of initial characters), but uses more memory.
(There are many memory vs. speed trade-offs possible here.)

We will add nodes to the trees as characters slide in, and delete them
as the slide out (when the first character of the string gets stepped
on by the lookahead buffer).  Most of the yucky-looking stuff in the
code is just insert/delete code for binary trees.

-=O=-     PROGRAM

The C sources which *everybody* uses were written by Haruhiko Okumura,
and can be found as "lzss.c" in the "samples10" directory.  The code
in the "samples10/C" directory had some minor modifications made to
make it fit the standard class "shell".  Incidentally, the fact that
the original LZSS code and the class code take the same arguments on
the command line is not coincidental.

The code uses N=4096 and F=18.  This allows the pointer's position to
be represented in 12 bits (0-4095), and the string length to fit in
the remaining 4 (3-18, offset by 3).  Since the <pointer,length> pair
occupies 16 bits, matches of length 1 or 2 are sent as individual
characters, and match lengths can begin at 3.

Some improvement in compression can be gained by using N=8192, but
this requires using F=10, matches only on even boundaries, or the loss
of the ability to store bytes on byte boundaries.  Being able to avoid
shifting bits into position speeds things up greatly, especially when
extracting.

-=*=-

The tree is represented with three arrays, which give the left son,
right son, and parent for every node.  There is no "node" array; the
ring buffer ("text_buf") provides that.

Note that the right child has 256 more entries than lson and dad.
These 256 entries represent the root nodes of the 256 binary trees.

The tree initialization routine (InitTree) does two things:
initializes the 256 root rson entries to NIL, and then clears all of
the parent entries to NIL.  A NIL dad pointer is used to indicate that
the node is not part of the tree.

-=*=-

Starting with Encode()...  First thing is to initialize the trees with
InitTree().  Next is to set code_buf[0] to zero.

code_buf is used to hold the output until we get eight items, where an
item is a pointer or a character.  Once we get the whole batch, we
dump it all at once.  code_buf[0] has the flag bits for the items, and
code_buf[1] through code_buf[16] hold the actual data.  code_buf_ptr
holds our current position in code_buf; it advances 1 byte for
characters and 2 for pointers.

Next we set "mask" to 1.  mask serves two roles: it is used to
logically OR a bit into code_buf[0], and it tells us when code_buf is
full.  Since mask is a single byte, and we initialized it to 1, after
8 left shifts it will become zero (clever, no?)

Next we initialize "s" and "r" (love those one-character variables).
"s" is the leading edge of the lookahead buffer, and therefore points
to the string which is about to be deleted.  "r" is the trailing edge,
and therefore points to the string which is about to be added (or was
just added).  We start with s=0 and r=N-F, which is a little weird
since it means we will be starting at the "end" of the buffer.

Next step is to initialize the ring buffer.  It doesn't really matter
what we fill it with so long as both encoder and decoder use the same
values.  Usually spaces or zeros are used.

Now we're ready to read some characters in.  The idea is to read F
characters into the buffer, starting at position r and not quite
reaching s.  We have to be careful though, because there might not be
F characters in the file.  "len" holds the actual number of characters
we read.  (In general, "len" represents the number of characters left
in the file, but is at most F... so, when there are more than F
characters left, "len" is ignored.  It makes more sense if you read
the code.)

Now we need to insert the strings into the tree.  Since the decoder
can overlap into the look-ahead buffer (see lesson 7 if you don't
remember), we are going to insert F different strings into the tree.
The first is a space followed by 17 characters, then two spaces
followed by 16 characters, and so on.  The source code claims that
inserting them with a gradually increasing #of spaces reduces the
chances of forming degenerate trees; I'm not entirely convinced this
is the case, but what the heck.

Finally, we insert the full string we have read.

-=*=-

One small item I have neglected to mention up until now is that the
InsertNode() routine does more than just insert the node in the proper
position in the binary tree.  It turns out that searching for string
matches and searching for the node's position are the same operation,
so we do both at once.

Two variables, "match_length" and "match_position", are set by
InsertNode.  These tell us how long the match was and where it
occurred.

That last InsertNode(r), which added all F characters we have read
from the input, did this:

     (1) search for a match.  Unless it began with a space, it didn't
find any.  On the other hand, if it was 18 spaces, then it would've
matched all 18 characters.  It appears at first that it would be
impossible to extract this - the decoder has no knowledge of the
contents of the file - but keep in mind that the decoder also starts
out with a buffer full of spaces, and don't forget the "memory move"
fill technique mentioned in lesson 7.

     (2) insert the string.  This amounts to adjusting lson, rson, and
dad appropriately (exact details discussed later).  Since the string
is added to the tree after the search, there is no possibility of the
string matching against itself.

With the initial match_length and match_position in hand, we fall into
the main loop.

-=*=-

First thing to do is to make sure match_length isn't huge.  Remember
than "len" is either F or the number of characters left in the file,
so it's impossible to match more than len characters.  However,
InsertNode may get confused at the end of the file, because the last
few characters in the lookahead buffer aren't valid.

Next comes the important part in the life of a pointer: determining
whether the match_length is large enough.  It's compared against a
constant called THRESHOLD, defined as 2, representing the size of a
pointer.

If the match_length is <= THRESHOLD, then we gain nothing by using a
pointer.  In fact, storing a match of length 2 may actually hurt us,
because the second character could be the start of a really good
match.  (In fact, we don't really know that accepting a match of
length 3, 4, ... is the BEST choice.  We're using "greedy" parsing,
which means that we grab the first thing that looks good.)

For characters, we just stuff the character into code_buf, increment
code_buf_ptr, and OR "mask" into code_buf[0].  We set match_length to
1 to indicate that we have used one byte of input.

For pointers, we encode the position and length into a single 16-bit
word.  It's done in sort of a weird way: the low byte is the low byte
of the position, but the low nybble of the high byte is the length,
and the high nybble of the high byte is the high nybble of the offset.
(Got all that?)  How about a picture (hi bit on the left):

     Length = llll (4 bits)        Position = ppppqqqqrrrr (12 bits)

     Pointer = ppppllllqqqqrrrr (16 bits)

It's a little faster in assembly to do llllppppqqqqrrrr because we're
using 16-bit operations instead of 8-bit, but not by much.

Once we output the appropriate thang, we shift "mask" left one bit and
check to see if it becomes zero.  If it does, we can dump the whole
buffer full of stuff, and reset code_buf[0], code_buf_ptr, and mask.

-=*=-

Now we need to add match_length new characters into the lookahead
buffer, since we just consumed them.  We count down match_length,
doing the following:

     (1) read a new character.  We won't need it for a little bit, but
we want to grab it now to test for end of file.

     (2) delete the string at position s.  Recall that this is the
leading edge of the lookahead buffer.

     (3) stuff the new character into the ring buffer at position s.

     (4) if the lookahead buffer wraps around the end of the ring
buffer, we also stuff the new character in past the end of the ring.
This makes string comparisons faster because we don't have to have
them wrap around the ring.

     (5) s and r are incremented by one, modulo N.  N doesn't have to
be a power of 2, but if it isn't then the modulo operation becomes
costly.

     (6) insert the node at position r.  Recall that this is the
trailing edge of the lookahead buffer, and therefore the string we
just inserted is the F characters found there.  The Fth character is
the one we just read.  Keep in mind that s and r were incremented
between placing the character at s and inserting the string at r...
the order of operations is important.

This is repeated until the lookahead buffer is full again.

-=*=-

At the very bottom of the main loop is a quaint little while loop
which only kicks in at the end of the file (i+1 will always be equal
to last_match_length unless EOF was encountered).  This routine does
the required InsertNode() and DeleteNode() calls without attempting to
get any more characters.  The result is that the last part of the
lookahead buffer will be invalid, so "len" is decremented at every
step.

If we do all the node shifting needed but haven't reached the end of
the input, we go around again.  When len finally bottoms out, the main
loop exits.

Below the main loop is a small code fragment which just outputs any
pending data left in code_buf.  Since all of the data is byte-aligned,
there is no need to send the file's length or a distinguished stop
symbol (in fact, we *can't* send a distinguished stop symbol without
doing something like defining offset 4095 as EOF, and then avoiding it
like the plague).  The decoder can just stop when it reaches the end
of the file.

-=*=-

Now, back to InsertNode and DeleteNode...

These aren't as bad as they look.  InsertNode starts off by setting
"cmp=1".  This ensures that the search loop follows the right child
first, which is a good thing to do since the 256 tree roots only have
right kids.

Next it sets key = &text_buf[r], which is just a pointer into the
string buffer.  The string pointed to by "key" is the string we're
adding; it's just done so we don't have to keep typing "text_buf[r]"
all the time (it's also faster).

Next comes p = N + 1 + key[0].  p is a pointer to the tree root;
rson[p] is the first node in the tree.  Remember that we pick from the
256 trees based on the first character of the string... key[0].

Now we set rson[r] = lson[r] = NIL.  Well, we're inserting node "r"
into the tree, so it can't have any kids right now, and it makes it
easier to clear out the pointers ahead of time.

Finally match_length = 0.  The reason for this should be obvious: we
haven't matched anything yet.

Now we drop into the body of the loop.  "cmp" will be the result of a
strcmp() operation; it is 0 for equal, 1 if one string is larger than
the other, -1 if the other string is larger.  Since we forced it to 1
up front, we will search down the right child.  The current node is
"p", which right now is the root of the tree.

If the right child of the current node (rson[p]) is NIL, we've hit the
end of the tree, and can just insert node r right there.  We set
rson[p]=r, dad[r]=p, and we're done.  If not, we walk down the tree by
setting p=rson[p].  An identical set of operations is performed if we
went down the left branch instead (the "else" clause).

Now comes the string comparison.  We know that the first character
matched, or we wouldn't be in this particular tree.  So we compare
from position 1 to position F-1.  If the characters at some position
don't match, we break out with cmp set positive or negative depending
on which came first ASCII-alphabetically.  "i" will have the number of
characters which matched.

If "i" is larger than the previous match length, set match_position to
the current node (p).  If the match_length is equal to F, then we
aren't going to find a better match anywhere, so we break out of the
loop.

This repeats until we walk off the end of a tree or we match all F
characters.  If we walk off the end, we add the node and exit.
Otherwise, we keep going left or right depending on the outcome of the
string comparison.  (This is why degenerate trees are bad... you keep
going, and going, and going...)

If we matched all 18 characters, we still have a little bit of work to
do.  If the file were full of an 18-character string repeating over
and over, we would eventually have a VERY unbalanced binary tree with
N-F copies of that string.  Which is fine if you're looking for that
particular one, but if you want to find OTHER strings with the same
first character then you get to search through the whole degenerate
list.

To avoid this, we only store one copy of the string in the tree.
Since the one currently in the tree will be deleted sooner than the
one we're adding now, we just replace the old one with the new one,
and set the pointers for the old one to NIL.

The really icky-looking stuff below the main loop in InsertNode does
just that, and nothing more.  Remember that "r" is the node we're
adding, and "p" is the current node; both point to identical strings
(in different places in the buffer).

All it does is set r's dad to p's dad, r's lson/rson to p's lson/rson,
then tells p's kids that r is now their dad, and finally tells p's
parent that r is now its child (the if/else is to determine if p was
the left child or the right child).  It looks ugly, but it's nothing
special.  Draw a small tree and try it.

Finally, dad[p] is set to NIL, which marks it as unused, so that
DeleteNode won't try to remove it.

-=*=-

DeleteNode is equally icky-looking, but it's all pointer manipulation.
Its job is to remove node p from the tree.

First we check to see if dad[p] is NIL; if so, it's not in the tree,
and we can exit immediately.

Now we try to find a successor for p, named q.  If p has only one
child, then that child will become q.  (If rson[p] is NIL, then set q
to the left child.  If lson[p] is NIL, then set q to the right child.
It's okay for q to be equal to NIL, because we have defined NIL = N.
This is why the array declarations use N+1 instead of N.)

If neither child is NIL, we have some work to do.  We need to replace
p with a node that is larger than p's left child but smaller than p's
right child.

Since this is an ordered binary tree, all nodes on the left are
smaller than the current node, and all nodes on the right are larger.
Therefore, the node which falls between the left and right children of
p is the rightmost descendant of the left child.  (It would be equally
valid to use the leftmost descendant of the right child.)

The icky part starts out by assigning q = lson[p].  If q doesn't have
a right son, we don't have to search any further.  If it does, we walk
q down the right pointers until we reach a node with no right child.
There, we remove the node from the tree by replacing the pointer to it
with a pointer to its left child (which may be NIL, or a large
subtree... all we know is that it has no RIGHT descendants).  That
done, we set the pointers so that it becomes the parent of lson[p]
(remember him?)  This node becomes q.

Now then, we're back to the place where p has two kids, but now we
know that the right child of lson[p] is NIL (either it started out
that way, or all the icky stuff we just did stuffed in a node with no
right child).  Now we can just set the right child of q to the tree
which was formerly on the right side of p.

With all that done, we know that p points to the original node and q
points to the replacement, and that p's kids are being pointed to by
somebody.  Now we just remove p from existence by setting q's dad to
p's dad, and pointing p's parent down at q (same old "if" for
left/right child).  The coup de grace is dad[p] = NIL, making it
officially gone.

I *strongly* recommend you draw a binary tree and remove a node.
ASCII graphics would not come out well, so I'm not even going to
attempt it.  Use a tree which looks like:

                         A
                        / \
                       /   \
                      B     p      <-- remove 'p'
                           / \
                          /   \
                         v     w
                        / \
                       /   \
                      x     y
                           /
                          /
                         z

Just walk through the routine and watch what it does.  Replace v, w,
and y with NIL and see how things change.  You should end up with y
replacing p, and z as a child of v.

This is similar to the deletion routine in Sedgewick's _Algorithms in
C_, though that one uses pointers instead of array indices.

-=*=-

Somewhere along the way, we might want to unpack the file.

Extraction is a truly trivial operation.  The input is a stream of
pointers and characters; the characters are just copied out, and the
pointers just give an offset and range to copy from the ring buffer.
The only reason this is more expensive than a simple copy operation is
that we have to update the ring buffer while we work.

Decode() starts off by initializing the array to spaces.  r is again
set to N-F, but this time there is no "s".  We set "flags" to zero to
indicate that the next byte is a flag byte.

The main loop starts off by shifting "flags" to the right.  flags will
be initialized to $ffxx, where "xx" is the byte read.  That way, after
eight logical right shifts the high byte will be zero (kinda like the
trick used in the encoder).

If flags is zero, we get a character from the input, and initialize
flags with it.  If we hit EOF, we are done, and can exit.  Hitting EOF
anywhere else is a bad sign, but we don't check for it in this
implementation.

The flag for the next item on the input stream is in bit 0 (the least
significant bit, unless you spend too much time around IBM
mainframes).  If it's a 1, we read one character from the input, send
it to the output, and add it to text_buf at position r.  r is
incremented, modulo N.

If it's a zero, we read two characters, and decode them into
<position,length> values.  We then copy the characters from that range
to the output and to text_buf at position r, incrementing r each time.

That's it.  We keep doing this until we run out of input.  It is
perfectly valid for <position> to be equal to r+1, which allows us to
do the "memory move" fill stuff.

-=*=-

And there you have it.  A very long description of what is really a
very simple algorithm.  It can be tricky to get all the details just
right - having stray tree pointers or not getting enough characters
can be tough problems to find - but it's not doing anything really
hideous, despite the code's appearance.

-=O=-     DEEP THOUGHTS

- We used greedy parsing, which can cause us to miss long matches
which begin in the middle of short string.  One modification is to use
"lazy parsing", which is greedy, but doesn't output a pointer until it
is sure that the current match is longer than the match found at the
NEXT position as well.  What would need to change to implement this?

- Lazy parsing is one step closer to "optimal parsing", in which the
best possible combination of pointers and characters is found.  How
hard would this be to implement?

- Having a larger ring buffer and longer maximum match length will
improve compression... usually.  They will also increase the size of
the <offset,length> pointer, which means that shorter matches will
occupy more space because the bit widths are larger.  Must they?

-=!=-

This document is Copyright by Genie and Andy McFadden