[fadden-logo]

Back to index


Hacking Data Compression  Lesson 8  LZW
By Andy McFadden
GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie


This week we delve into the famous LZW algorithm.  It probably
would've made more sense to deal with LZ77 variants first, but this is
one of the more interesting algorithms and I figured everyone would
want a chance to look at it early on.  We'll catch up with LZSS in
lesson 10.

The week's lesson picks up where we left off after the discussion of
LZ78 in lesson 7, so it would probably be a good idea to go back and
read that part before you begin this one.

REMINDER: LZW is patented by IBM and Unisys.  Don't even THINK about
using it in something you make money off of without asking for a
license.


-=O=-     Difference from LZ78

Welch, Terry A., "A technique for high-performance data compression,"
IEEE Computer, Volume 17, #6, pages 8-19, June 1984.

LZ78 outputs two symbols each time, the longest matching phrase and
the first character that didn't match.  This makes it easy to rebuild
the dictionary while decoding; you just go to the tree entry for that
phrase and add a new branch corresponding to that character.

However, adding the extra 8 bits of data onto every phrase increases
the size of the output.  LZW was designed to eliminate the extra
character.

The important change is that only phrase pointers are output.  This
means it's no longer sufficient to start with only the null phrase in
the dictionary as in LZ78 (chicken and egg problem).  Instead, phrases
$00 to $ff represent the set of single-character phrases.

An important side effect is that instead of sandwiching the first
character that didn't match between two phrases, we include it as the
first part of the NEXT phrase.  This causes some interesting problems
when decoding because it's possible to encounter a phrase which is
still missing the last character.

-=*=-

Time for an example.  I'm going to use the string from page 227 of
_Text Compression_, namely "aabababaaa".

We start off with one entry in the table for each input character.
This means that we could start with just 0=a and 1=b in the table and
add new ones at 2, but I found this more confusing than helpful.  I'm
going to start the new phrase pointers at $101 ($00-$ff are used for
the characters, and $100 can have a special meaning as we will see
later).

So, we start processing the input.  We have phrase 0="a" and phrase
1="b".

First we get an 'a', which happens to be in the dictionary.  So, we
get the next character, another 'a'.  Since "aa" isn't in the
dictionary, we set $101="aa" and output the prefix for 'a' (0).

Here's where things get different.  With LZ78, we would've output the
0 followed by the second 'a'.  With LZW, we start the NEXT phrase with
the LAST character, so we just output the 0 and hang onto the 'a' for
now.

Lo and behold, the phrase "a" is in the dictionary.  Next character is
a 'b', which isn't in there, so set $102="ab" and output the prefix
for 'a' (0).

Next we start off with the 'b' we just read.  Next character is an
'a', so set $103="ba" and output the prefix for 'b' (1).

Start off with 'a', next character is a 'b', which is phrase $103.
Next character is an 'a', so set $104="aba" and output $103.

Start with the 'a', find a 'b' (phrase $103), find another 'a' (phrase
$104), and then another 'a'.  Set $105="abaa" and output $104.

Finally, we use the previous 'a', read an 'a' (phrase $101), and then
hit the end of input.  So we output $101, and are done.

Here's a table to summarize (same as in _Text Compression_):

Input:     a         a         b         ab        aba       aa
Output:   $00       $00       $01       $102      $104      $101

The phrase list looks like this:

     Phrase #       Phrase         How derived
     --------       ------         -----------
       $00            a             (initial)
       $01            b             (initial)

       $101           aa            $00  + 'a'
       $102           ab            $00  + 'b'
       $103           ba            $01  + 'a'
       $104           aba           $102 + 'a'
       $105           abaa          $104 + 'a'

If you look carefully at LZ78 vs LZW, you will see that there is a
greater amount of overlap between phrases, and therefore the phrase
table grows more quickly.  There are LZ78 variants which actually
combine the last two phrases to form new phrases; this causes the
phrase table to expand much more quickly.

-=*=-

Now let's decode the values we output.  Again, our dictionary has
0="a" and 1="b" in it.  The input stream will be:

$00       $00       $01       $102      $104      $101

The first thing we get is $00, so we output an 'a' and set $101="a"
plus whatever comes next (I'm going to write $101="a+").

Next up is $00, so we output another 'a', set $101="aa", and set
$102="a+".

Next we get $01, so we output a 'b', set $102="ab", and $103="b+".

Next is $102.  We output "ab", set $103="ba", and set $104="ab+".

Now we get $104.  Currently, $104 is "ab+" - we haven't figured out
what the last character is yet!  However, we know that the FIRST
character of the next phrase is equal to the LAST character of the
previous phrase; since both of those are phrase $104, we know that the
last character of phrase $104 must be the same as the first character
of $104.  Therefore, we set $104="aba", output "aba", and set
$105="aba+".

Finally we get $101, so we output "aa", set $105="abaa", and set
$106="a+".  Now we're out of input, and are done.

The funny situation is referred to as the "KwKwK" case, because it
happens whenever the input follows that pattern (where K is a
character and w is a string of characters).  This arises because of
the overlapping character in the middle and the "delayed" final
character.

-=O=-     Other Considerations

The first issue is what to do when the phrase table gets full.  We
can't just grow the tree without bound unless we've got unlimited
amounts of memory available.  There are two basic approaches.

(1) Stop adapting.  After the table is full, compression becomes
static.  This is fast and works reasonably well if the first part of
the input is representative of the remainder of the input, but does
poorly otherwise.

(2) Clear the table.  When the table fills, completely erase it, and
start over from $101.

There are a number of variations on (2).  You could remove the oldest
phrases or the least used phrases, but this requires fancy data
structures and slows things down a lot.  Recent versions of the UNIX
"compress" command combine the two: compression becomes static when
the table fills, but the table is cleared if compression performance
worsens.

The sample code for this lesson uses phrase $100 as an explicit "clear
table" code (this is the "special use" mentioned earlier).  It's not
strictly necessary, but it makes the code easier to understand and
debug.  I also use two consecutive clear table codes as an end-of-file
indicator.

-=*=-

As I mentioned in the previous lesson, you can get better results by
encoding the phrase pointer in a variable number of bits.  Since we
already have 257 entries in our dictionary, we start with codes that
are 9 bits wide.

The size of the phrase table and the maximum width of the codes are
directly related.  Programs like ShrinkIt limit the codes to 12 bits,
which means that there are at most 4096 entries in the phrase table
(which includes the first 256 single-character entries and the $100
clear code).  This is done for practical reasons, because the data
structures holding the phrases are 5 or 6 bytes per entry, or 20-24K
(quite a bit of space on an Apple ][+ or //e).

Some programs have an explicit "change bit width" code (e.g. Nelson's
_The Data Compression Book_).  This is slightly silly; it's obvious
that $3ff will fit in 10 bits while $400 will not.  You may be able to
save a few cycles by hardcoding some indices, however.

Of course, if speed is your primary goal, then it would make sense to
just do away with the variable bit widths altogether.  Using
fixed-width 12-bit output codes hurts compression but makes the
compressed data easier to encode and decode.

No program that I'm aware of has pointers wider than 16 bits (the
default for UNIX compress).  After a certain point you see diminishing
returns.  Also, a file that fills up a phrase table with 64K entries
would have to be pretty large, so the extra table space needed would
be a waste for most things.

-=O=-     Uses

LZW is remarkably fast for an adaptive scheme (easily the fastest
adaptive scheme in this course), and beats most of the others as far
as compression performance goes.  As we will soon see, the algorithm
is very straightforward, making it ideal for a pure hardware
implementation.

As most of you should already be aware, LZW is the key to ShrinkIt and
GS/ShrinkIt (the former clears the table after every 4K of *input*;
the latter uses $100 to clear it when the table fills; both have an
RLE pass immediately preceding the LZW).  It's used in UNIX compress,
which is used for just about everything.

It's also part of the GIF (Graphics Interchange Format) standard.  I
don't know if the authors of commercial GIF converters have to pay a
fee or not; feel free to ask them.

One interesting application is v.42bis, the data compression standard
for high-speed modems.  They used a cute trick where they send a
special code to toggle the compression off when it fails to make the
data smaller.  Both sides continue to update the LZW trees, so that if
compression improves it can be turned back on.  This way, you get the
benefits of compression without having to worry about the transmission
slowing down if the data being sent is already compressed.

(Nelson's book has an interesting note: there was some debate over
whether or not LZW was suitable due to the patent restrictions, so
Unisys decided to charge a one-time fee of $25,000 per manufacturer
for licensing.  This was enough to protect Unisys's patent and
(apparently) make everybody happy.)

And, of course, it's part of HardPressed (obligatory self-promotion).

-=O=-     PROGRAM

As usual, there are C and assembly implementations of a simple LZW
compressor.

I wrote it to be easy to read, and to be easy to compare the C and
assembly versions.  There are a number of optimizations which could be
made, but those are left as an exercise for the reader.

When you get right down to it, there really isn't all that much code
here.  Part of it is the relative simplicity of the LZW algorithm, and
part is an elegant implementation.  If you compare the C and assembly
versions, you may find that it's actually easier to understand the
assembly code, because LZW translates well to 65816.

-=*=-

The encoder and decoder are strikingly different.  First, the encoder.

LZW requires a tree with nodes that can have up to 256 children.  That
would require 4096*256 entries to represent entirely, which is
unacceptable.  We could use a dynamic tree structure, but that
requires either space for 256 child pointers per node, or an expensive
linear search.

The usual solution is to use a hash table, finding entries based on
the prefix and character, and probing after a hash collision.  We will
have at most 4096-257 entries in our tree (12-bit codes, minus the 256
single-character phrases and the clear-table code), so we can use any
prime number larger than 3839.  To keep collisions to a minimum, we
use 5119.

Each entry in the hash table has three values: "entry" (the value from
$101 to $fff), "prefix" (the prefix represented by the parent node),
and "chr" (the character we added to get here).  This gives us a
(roughly) 25K hash table.  To conserve memory, we can overlay this
with the tables used for decoding (the assembly version does, the C
version doesn't).

The assembly version uses three different arrays so that the same
index register can be used to access each.  The C version uses a
struct.

-=*=-

The first thing the encoder does is clear out the table by setting the
"entry" fields to zero.  This marks the hash table entries as empty.

The outer loop grabs a character, and falls into the main loop.  This
character becomes the initial prefix.  Because of the way the
implementation works, in a sense we're always running one step ahead
of what we're working on... we won't output the prefix until after
we've located the next.  This is because we need to store
(prefix,character) in the tree at the node BELOW the node for "prefix"
for fast searches.  (Did that make sense?  From the node for the
prefix, you take the path for that character to a child node.  That
child node gets (prefix,character) stored in it.)

Once we have a prefix, we fall into the "input loop", which keeps
grabbing characters until we find an empty spot in the tree or run out
of input.  When we find a blank spot in the tree, we create the new
node (by filling in the entry in the hash table) and branch back to
the main loop, with the prefix equal to the last character we found.

The hash function is rather simple, but effective.  The secondary
probe is also nothing fancy; it just walks backward through the table.

If we determine that the table has filled, we output the last
character read (which would have become the prefix for the next trip
around), and then output $100.  Then we clear the table out, reset
entry to $101, and branch back to the beginning.

When we reach the end of the file, we want to make sure that there are
two consecutive $100s at the end of the output.  If we just output a
$100, however, then the put_code call after "eof_hit" just output
another one (take a look at the code to see what I mean).  Outputting
two more would make a total of four, which is wasteful, so we check
for that and avoid it.  Take a careful look at how "entry" is adjusted
and what values are used for it when calling put_code(); if you don't
get these right, the encoder and decoder can become out of sync.

-=*=-

The put_code() (or PutCode) routine deserves little attention.  It's
really nothing fancy; just outputs a variable-width code from 9 to 12
bits.  It's similar to the routines we've seen before, though we can
get away with a minor optimization: since the codes are always at
least 9 bits, we know that we can output at least one byte every time.

You could replace the routine with one that uses only fixed-width
12-bit codes, and get a slight increase in speed with a slight loss in
compression.  The speed increase is actually smaller than you might
expect... the main routine spends a considerable amount of time
running around probing through the hash table.

-=*=-

The decoder needs to reconstruct an identical tree, but doesn't have
the searching problem that the encoder does.  The input consists of
phrase pointers ranging from $00 to $fff, which can be used as indices
into an array.  Thus the decoder data structure acts as both a simple
array and a 256-ary tree.

The decoder table entries have two fields, "prefix" (a pointer to the
node's parent) and "chr" (the character added by this prefix).  So
"ab" might be represented by node $104, which contains ($61,'b'),
meaning that you should output a 'b' after prefix $61 (which is simply
the letter 'a').

As mentioned earlier, we are going to walk up the tree to grab the
characters, but need to output them as if we were walking down.  The
example in the above paragraph illustrates the need to stuff the
characters onto a stack: we want to output 'b' after everything in
prefix $61.  A fully degenerate tree, which you would get by
compressing a file full of identical characters, could be 4096-256
=3840 bytes tall, so the stack must be at least that large.  This is
more space than we want to grab from bank 0, so we use a software
stack instead.

-=*=-

We start off by initializing all of the various counters.  There's no
need to initialize the table, since the choice of which entry to use
is driven by the encoded output.  For this reason, it's possible to
have more than one compressed representation of a file.  For example,
the encoder could send table clears at odd times, or use only part of
a phrase (neither of which is likely to be helpful).

If "entry" is $101, then we either just got a table clear code or are
just beginning.  So, we grab a character to start things off.  Since
we know that the first code is always < $100 (unless it's a table
clear), we just output it immediately.

Each pass through the loop, we get another phrase pointer with
get_code() (or GetCode).  If it's a table clear, we reset the indices
and go back to the start of the loop.  If it's larger than the next
available entry in the table, then something went wrong.  If it's
exactly equal to the next available entry, then we've encountered the
KwKwK case, so we push on the last character from the previous pass
through the loop (this is equivalent to filling in the "+" in the
explanation).

With that out of the way, we begin walking up the tree, pushing
characters on as we go.  We know we've reached the top when the
pointer becomes less than $100.  Instead of pushing the last pointer
on and then popping it off immediately, we just output it.  (Note that
we're outputting a pointer as if it were a character.  In theory, this
is wrong.  However, the low 8 bits of the pointer are equal to the
character, and the high 1-4 bits are always zero, so it works out.)

After we've output the entire stack, we add the prefix/character
combination to the table.  Actually, we're adding the PREVIOUS prefix,
because we didn't know what the last character was until this time
through the loop.  The CURRENT prefix isn't added until later.  Thus,
we're still running one step ahead of ourselves, which can be
confusing.

The nice thing about the decoder is that nothing special needs to be
done at EOF.  When we see two consecutive table clears, we can just
exit.

-=*=-

The get_code() (or GetCode) routine is also rather unremarkable.  It
extracts variable-width codes from 9 to 12 bits in width.  Since we're
running ahead of ourselves, we have to use entry+1 instead of entry.

Beyond that, it's just a matter of getting all the bytes needed,
shifting them into place, and then ANDing off the unnecessary stuff.

-=*=-

That's about it for the code.  It compares favorably to ShrinkIt in
terms of speed, but remember that ShrinkIt also has to do a CRC and an
RLE pass.  Not to mention that P8 ShrinkIt is written in 65c02, so all
the nice 16-bit stuff we were doing is much harder.

If you want a better understanding of the code, try stepping through
it with the examples we did earlier in the lesson.

With some optimizing this can be made pretty fast.  Have at it.

-=O=-     DEEP THOUGHTS

- Is it possible to pre-load the phrase table with things like "if",
"and", "the", "that", and so on?  Would effect would it have?

- Suppose you wanted to remove phrases from the table on an LRU (Least
Recently Used) basis.  How would you communicate this to the decoder?
What would have to change in both the encoder and decoder?

- Why doesn't the decoder have a problem with KwJwK or KwKwJ (where K
and J are different characters)?

- Is it possible to remove the stack from the decoder?

- Does the code work correctly for an empty file?

-=!=-

This document is Copyright by Genie and Andy McFadden