[fadden-logo]

Back to index


Hacking Data Compression  Lesson 7  Ziv and Lempel
By Andy McFadden
GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie


There is NO source code for this week's lesson.  If you're only after
sample code, you can stop reading now.

This lesson is an introduction to the algorithms on which most recent
compressors are based.  They are known as LZ77 and LZ78, referring to
the papers published by Ziv and Lempel in 1977 and 1978, respectively
(the transposition of 'L' and 'Z' was made early and propagated
widely).  They form the basis of almost all modern adaptive dictionary
compression techniques.

They are also patented to death.  All LZ78-class algorithms and many
LZ77-class algorithms are the victims of software patents, meaning
that if you use them you will have to pay the patent holders or face a
court challenge.  Many of the patents are based on very rocky soil,
but few of us can put up enough money to fight them in court.  Most of
the patent holders will not go after non-commercial uses of patented
algorithms, which is why we have programs like ShrinkIt and UNIX
compress.  However, if you want to use them in a commercial product
(e.g. HardPressed), you will have to pay.

-=O=-     The story so far

Just to make sure we're all up to speed...

- Dictionary compressors replace a symbol or collection of symbols
with an index into a dictionary.  How the symbols are chosen and how
the index is represented in the output are implementation-dependent.
See Lesson 3 for a more in-depth discussion of dictionary models.

- Adaptive dictionary compressors start out with no knowledge or with
a small static dictionary, and then expand the dictionary with pieces
of the input as they work.  See Lesson 5 for a discussion of the
merits of adaptive models.

We've already seen one example of an adaptive dictionary compressor,
the MTF (Move To Front) compressor from Lesson 3.  However, that only
worked with whole words, and had some complications.

-=*=-

What makes these algorithms valuable is that they combine the
advantages of an adaptive model with the speed and efficiency of
dictionary models.  Not will LZ78 compress better than your average
semi-adaptive Huffman compressor, it will do it much more quickly.

What is surprising about these schemes is that they are simple in both
concept and implementation.

-=O=-     LZ77

Jacob Ziv and Abraham Lempel, "A universal algorithm for sequential
data compression", IEEE Transactions on Information Theory, Volume 23,
Number 3, May 1977, pages 337-343.


LZ77 works by treating the input file as the dictionary.  Instead of
encoding an index referring to a dictionary full of strings, it
encodes a triple <off, len, char>, where "off" is the file offset at
which the string starts, "len" is the length of the match, and "char"
is the first character that didn't match.

For efficiency reasons, most LZ77-based schemes use a "sliding window"
approach.  The compressor keeps track of the last N bytes of data
(usually 4K or 8K) in a buffer, sliding it along one character at a
time, adding the next character at the front of the buffer and
removing the character at the end.  The buffer looks something like
this:

     0  1  2  3  4  5  6  ||  7  8  9
     a  b  b  a  c  a  c      b  a  c

The characters on the left have already been compressed.  The
characters on the right do double duty as a lookahead buffer (which
can be useful, as explained later) and as the string to compress next.
The number of bytes in the lookahead buffer is referred to as 'F', so
the first N-F characters in the buffer have already been encoded.

When compressing, LZ77 searches through the buffer to find the best
match.  This is just a simple string comparison, so a simple-minded
approach would roll through and compare every string in turn (i.e.,
compare F bytes starting from offset 0 with the bytes in the lookahead
buffer, then compare F bytes starting from offset 1, and so on).  When
it finds the best match or runs out of buffer, it outputs the buffer
offset at which it was found, and how many characters it matched.

After compressing a string, we shift everything left by "len"
characters, and grab another "len" characters from the input.  It
should be clear that the longest string you can match is equal to F,
the size of the lookahead buffer.  (This is not a limitation of the
family of algorithms - you can do away with the lookahead buffer
entirely if you want - but it's the way LZ77 works.)

To correctly deal with the case where NOTHING matched (perhaps a
character seen for the first time), the first character that didn't
match is also output.  There are ways to avoid doing this, which we
will see in a later lesson.

All the decompression routines need to do is copy "len" bytes of data
at the specified offset to the output, and tack on the extra
character.  This makes extraction very fast, since it's really nothing
more than a memory copy.

(I'm being somewhat imprecise here; what's important is to see that
its dictionary is nothing more than the last N-F bytes of input.)

-=*=-

The neat thing about the lookahead buffer is that you can do something
reminiscent of an old Apple II monitor memory fill technique.

Suppose you have this situation:

     1  2  3  4      5  6  7  8  9
     a  b  c  x  ||  x  x  x  x  y

The string comparisons are allowed to overlap into the lookahead
buffer, so long as at least the first character is outside.  So to
compress the string "xxxxy", you would output <4, 4, 'y'> because the
first match occurred at 4, it ran for four bytes, and the first
mismatch was 'y'.

When this is being extracted, the decompressor will see something
like:

     <?, ?, 'x'> <4, 4, 'y'>

So it will already have the 'x' at position 4.  It then copies the 'x'
from position 4 to position 5, then copies position 5 (which is now an
'x') to position 6, and then copies position 6 (which is also now an
'x') to position 7.  Position 7 gets copied to position 8, and finally
it sticks the 'y' into position 9.

It's not immediately obvious why this works in general; after all,
we're matching part of the string with itself!  The trick is that we
are matching the LATER pieces of the string with the EARLIER pieces of
the string, and the EARLIER pieces of the string with stuff that we've
already compressed (or extracted).

-=*=-

There are some other implementation details to consider.  First, the
sliding window.  Shifting characters over by 1 every time we need more
input is rather slow.  The usual solution is to use a circular buffer,
and then treat all values mod N.  For this reason, the buffer size is
usually a power of 2 (so you can do a bitwise AND to do the modulus).

Second, we could really use some auxiliary data structures to speed up
the searching.  This doesn't have to be terribly complicated.  One
commonly used technique is to build a binary tree with N-F entries in
it (one for every encoded string), each of which is F bytes long.
This sounds like it uses a ton of memory, but it actually doesn't,
because each entry in the tree is just an offset into the sliding
window.  Nodes are added and deleted as the window shifts.

Third, we haven't said what's in the buffer at the outset.  It
actually doesn't matter so long as the encoder and decoder have the
same picture.  Traditionally it is filled with either spaces or zeros.

-=*=-

This is one of the more intuitive models we've seen.  Variations on
this allow the string to start at any position in the file and go on
for any length, but this requires some fancy encoding techniques and
may require infinite amounts of memory.

The neat thing about this is that it will grab whole words or even
groups of words that have been used in the recent past.  The penalty
for encountering new words is slight, since many words can be broken
down into smaller words that have been seen previously.  This, plus
the slick RLE-like nature of the lookahead buffer (which will work for
pairs or groups of symbols, making it almost as effective as
PackBytes), contributes to a very effective model.

The main drawback to LZ77 is the compression speed.  This can be
reduced with fancy hashing techniques, but there's no way to avoid
doing a lot of string comparisons.

From a theoretical viewpoint, the failing of LZ77 is that it forgets
about data older than 4K or 8K.  Circumventing this problem not only
requires large amounts of memory, it also implies doing string
comparisons against the entire file, so the time required increases
linearly.  Also, the fixed maximum value of "len" gets in the way,
because it limits the maximum compression attainable to "len" divided
by the size of the <offset, size, char> triple.

From a practical standpoint, the method of encoding the dictionary
index is lame.  If your first five characters are "abcde", then you
will have a series of triples that look like <0, 0, 'a'> <0, 0, 'b'>,
and so on.  LZSS removes the painful waste from the encoding, and LZH
and LZB take things a step further by encoding the offset and length
values more efficiently.

-=O=-     LZ78

Jacob Ziv and Abraham Lempel, "Compression of individual sequences via
variable-rate coding", IEEE Transactions on Information Theory, Volume
24, Number 5, September 1978, pages 530-536.

The LZ78 family of algorithms represent a completely different
approach to the problem.  It is interesting from a practical
standpoint (because it compresses quickly), and from a theoretical
standpoint (because the compression gradually becomes optimal).

Instead of treating the input as a big wad of text, LZ78 breaks the
input into "phrases", where each phrase is the longest matching
previously seen phrase plus the first character that didn't match.

That didn't make much sense.  Let's try an example (I'm going to pull
the one out of Bell/Cleary/Witten so I don't blow it and confuse you
worse).

Suppose you have the input string "aaabbabaabaaabab".  Kinda boring,
but hey.  The LZ78 compressor gets the string one character at a time.

First, it gets 'a'.  It doesn't have any phrases in its dictionary, so
it says "1 = a" and outputs <0, 'a'>.  This means, "when decoding,
take phrase 0 (the empty phrase), and tack on an 'a'."  The decoder
automatically assigns the result of this to phrase 1, which keeps the
decoder in sync with the encoder.

Next, it gets another 'a'.  It has a phrase in its dictionary which
matches it, so it goes and grabs the next character, another 'a'.  It
assigns "2 = 1+a", and outputs <1, 'a'>, since we got phrase #1
followed by an 'a'.

Next character is a 'b'.  We haven't seen this before, so set "3 = b"
and output <0, 'b'>.

Next up is another 'b'.  Matches phrase 3, so grab the next character,
an 'a'.  We don't have "ba" in the dictionary, so set "4 = 3+a" and
output <3, 'a'>.

Now we're starting to roll.  So far we have:

        Input:   a     aa    b     ba
Phrase number:   1     2     3     4
       Output: <0,a> <1,a> <0,b> <3,a>

(Note that "ab" is NOT a known phrase, even though it has appeared in
the input.  Thus LZ78 doesn't catch ALL substrings like LZ77 does.)

Next character is a 'b', phrase 3.  Next one after that is an 'a',
phrase 4 (NOT phrase 1... we're looking for phrase 3 followed by an
'a', not phrase 0 followed by an 'a').  Next is another a, but we
haven't seen "baa" before, so set "5 = 4+a" and output <4, 'a'>.  (The
decoder will know that 4 = 3+a, and that 3 = 0+b, so it will output
"baa".)

Next character is a 'b' (phrase 3), followed by an 'a' (phrase 4),
followed by another 'a' (phrase 5), and then another 'a'.  So we set
"6 = 5+a" and output <5, 'a'>.  This means we just replaced the string
"baaa" with the pair <5, 'a'>, so our compression rate is improving.

Finally, we get a 'b' (phrase 3), an 'a' (phrase 4), and a 'b' (not
seen yet), so we set "7 = 4+b" and output <4, 'b'>.  The final results
look like this:

        Input:   a     aa    b     ba   baa  baaa   bab
Phrase number:   1     2     3     4     5     6     7
       Output: <0,a> <1,a> <0,b> <3,a> <4,a> <5,a> <4,b>

A couple of things should be pointed out.  First, this is cutting the
input up into little pieces, so if your next input string is
"baabaaa", you will end up with a new phrase for "baab" (5+b), whereas
LZ77 would have been able to match the whole thing.

On the other hand, there is no limit on the length of the phrases.  As
you can see, they slowly get longer as we process more and more input.
There is no limit on how far back a phrase can reference.  When we
fill up the memory we have available, we can simply clear the table
and start compressing again with only the null phrase (phrase 0).

From a practical standpoint, LZ78 has MUCH faster compression, because
the phrases can be represented by a "trie" data structure.  Each node
in the trie represents a particular phrase.  There is one child for
every character which can follow that phrase.  (Note that it is NOT
necessary to have one child for every POSSIBLE character which can
follow; only those we have already seen.)  The trie for the above
sequence would look like this:

                         0
                     a /   \ b
                     /       \
                   /           \
                  1             3
               a /           a /
                /             /
               /             /
              2             4
                         a / \ b
                          /   \
                         /     \
                        5       7
                     a /
                      /
                     /
                    6

(This is a binary tree because there were only two different
characters in the input.  In practice there can be up to 256 children
of a given node.  Note that we start out with just node 0, and
gradually add paths and nodes as we go.  This can be done efficiently,
as we will see later on in the LZW lesson.)

When considering the input string, we start at node 0, and traverse
the appropriate branch at each step.  So if we got "baab" next, we
would start at 0, take the 'b' branch to 3, take the 'a' branch to 4,
take the 'a' branch to 5, and then try to take the nonexistent 'b'
branch.

When we reach a node that doesn't have a branch for the next
character, we create a new node and a branch to it, then output the
number of the node we stopped at and the character we stopped with.
This enables the decoder to reconstruct an identical tree.  For the
example in the preceding paragraph, we would take the last node we
went through (5) and add a 'b', so we'd output <5,b>.

When decoding (recall that the output consists of <phrase, char>
pairs), we start at the node specified, and work our way up, pushing
characters onto a stack.  When we reach the root (phrase 0), we dump
the contents of the stack, followed by the character.

(If you're Not Quite Getting It, try working through the previous
example, but by building the tree as you go (hey, if you've got the
lesson printed out, it's a good excuse to scribble all over it).  It's
a lot easier to understand if you watch the tree being built.  It is
VERY important that you get LZ78 straight in your head, because LZW
takes some odd turns and can be very difficult to follow.)

-=*=-

One important consideration is how the phrase number will be encoded.
Since there is no limit on the size, we need an efficient way to
encode it.  The solution is to encode it in a variable number of bits;
recall that a number up to P can be represented in floor(log2 (P))
bits (which is a fancy way of saying that you can go from 0 to 255
with 8 bits).

So, you start out with one bit.  After phrase 1, you use two bits.
After phrase 3, three bits.  After phrase 7, four bits, and so on.
The size is automatically expanded by both the compressor and
decompressor, so there is no confusion over how many bits to get.

The character is simply dumped into the output.  This is wasteful, but
is corrected by LZW.

-=*=-

As you can see by thinking about how the tree is built, there is no
limit on the length of the phrases other than what happens when we run
out of memory.  It should also be clear that each new phrase is made
up of an old phrase plus one character (which is what I was trying to
say back at the very beginning, only it made less sense then).

From the example we did, you can see that construction of the
dictionary proceeds more slowly than it did for LZSS.  It doesn't
contain the COMPLETE set of strings seen so far.  However, it does
contain a UNIQUE set of strings seen so far - each addition to the
table is a previously seen prefix plus one character which combine to
form an as-yet unseen string.

From a theoretical standpoint, the compression is "asymptotically
optimal" as the size of the input increases, meaning that an
indefinitely long input will be compressed down to a size very close
to its entropy.  Unfortunately, most inputs are considerably shorter
than infinity, and you'd have to compress everything ever written
before you got close.  (Of course, you'd probably run out of memory
first...)

There's a restriction on the previous statement which says that the
source input must be "ergodic", which means that "any sequence it
produces becomes entirely representative of the source as its length
grows longer and longer."  Whatever.  It just means that if you have
nothing but 'a's for 2 gigabytes, you're going to get a big dip in
compression (and a discontinuity in your asymptote) when you start
hitting 'b's.  Most kinds of input, especially text, are ergodic, so
this is a pretty weak restriction.  (Don't worry about this much; it's
an interesting result, but not useful.)

LZ78 is the basis of LZW, which is used in many different places
(ShrinkIt, GIF pictures, v.42bis for modems, etc).  LZW will be
discussed in depth next lesson.

-=O=-     Relation to statistical schemes

As I mentioned in passing a while back, any dictionary scheme can be
converted into an equivalent statistical scheme.  There is a paper out
there somewhere which examines LZ78 from this standpoint.

By assigning probabilities to the branches of the tree and to future
events, it can be shown that LZ78 does poorly near the start of the
input when it has no information in the tree.  This corresponds to
what we saw in practice: LZ78 did poorly at first, but seemed to be
gathering steam as it went.

It can be interesting to try to combine something like LZW with either
a statistical model or an LZ77 model until it gets going.  The results
aren't all that great, but it's something to do if you're bored.

-=O=-     PROGRAM

Hey, I already said there wasn't one.

-=O=-     DEEP THOUGHTS

- What happens when you run these backwards?  (If you're having
trouble picturing it, assume a full dictionary for LZ78 and a buffer
full of stuff for LZ77.  Then start feeding random bits to the
decoder.)  What effect does the extra character added after each
offset/phrase have?

- I previously stated that bigger dictionaries == better compression
(back in Lesson 3 or so).  Does this hold true for LZ77?  What about
LZ78?  (Hint: consider the uniqueness property of LZ78.)

- When decoding LZ78, the characters are pushed onto a stack as the
tree is traversed.  How big does the stack need to be?

-=!=-

This document is Copyright by Genie and Andy McFadden