HDC Lesson 11

Back to index
Hacking Data Compression  Lesson 11  LZH, LZARI, and LZB
By Andy McFadden
GEnie A2Pro Apple II University - Copyright (C) 1993 GEnie


This is the last of the lessons dealing with specific compression
methods.  The twelfth and final lesson will be on writing
HardPressed(tm) compression modules.

All three of the methods described this week are nothing more than
LZSS with improved encoding.  Thus, lesson 7 (LZ77/LZ78) and lesson 10
(LZSS) are prerequisites.

-=O=-     A brief history of time

I've included a file called "History" in the samples directory.  It's
an explanation of the history behind LZH, and has a brief explanation
of several of the algorithms we've explored during this course.  You
should be able to read and understand all of it by now.  If you're
having trouble following one of the sections, go back and read the
lesson that covered it again.  Sometimes this stuff takes a while to
soak in, and this file may help to point out the weak spots.

Perhaps the most interesting part about it is the stuff that ISN'T
said.  The LZSS implementation from last week began as a simple
compressor (no binary trees) back in 1988.  Recall that LZ77 was
introduced in 1977, and LZSS was published in 1982.  Everybody was so
hung up on LZW that they fully ignored LZSS until Haruhiko Okumura
wrote some code and starting blowing the doors off existing software.

Anyway.  It all began with LZSS.

-=O=-     LZARI and LZH

The first attempt at improving compression involved arithmetic coding.
This was a natural choice, given that it's (theoretically) the best
possible encoding method.

Problem was how to add it.  Encoding the free characters and the
pointers separately made sense, because they really have no relation
to each other.  If a simple frequency encoder can reduce the size of
English text by 20-30%, then the amount of space used by the
individual characters should be reduced by a similar amount.  (Well,
okay, the results would vary because you are no longer compressing
English text, but rather the stuff that didn't match a previous
string.  This would likely cause words with rare letters to end up as
individual characters more often than common words, resulting in a
flatter probability distribution, and less compression.  I don't have
exact figures.)

Attempting to adaptively compress all 65536 possible two-byte pointers
would be insane (imagine how small the probability intervals would
be!).  In fact, trying to adaptively compress all 4096 buffer
positions would be pretty crazy.

The solution used in LZARI is somewhere between brilliant and
deranged.  The free characters and the match lengths were encoded
adaptively with the a single frequency table, by using "character"
253+length to represent the match length (where length goes from 3 to
F).  He used F=60 to get longer matches (remember, we no longer have
to worry about things being an integral number of bits, so we don't
need to stop at 18 characters).

The position was encoded with a static-frequency model, which assumed
that matches were most often found in the most recently seen data.
The program comments point out that this is probably sub-optimal.  But
it worked.

So.  When you want to output a lone character, you call EncodeChar()
with a value between 0 and 255.  When you want to output a pointer,
you call EncodeChar with a value between 256 and 256+F-THRESHOLD, and
then EncodePosition with a value between 0 and N-1.  When decoding,
you ask for the next character, and check the value.  If it's >= 256,
then what follows is a pointer; otherwise you just about the value as
a character.

The neat part here is that the extra bit we had to use to tag
characters and positions got swallowed up by the arithmetic encoding.
If there are more pointers than free characters, the extra information
needed to tell the difference will be less than one bit for pointers
and more than one bit for characters.  (This is the beauty of a good
encoder.)

-=*=-

As we saw back in lesson 6, arithmetic encoding can be *very* slow.
So, a guy named Haruyasu Yoshizaki changed the LZARI sources to use
Huffman encoding instead.  The result, which he called LZHUF but is
more commonly known as LZH, compressed almost as well as LZARI but in
a reasonable amount of time.

The method used to encode characters and match lengths is identical,
but uses adaptive Huffman.  For the buffer offset, the upper 6 bits
are encoded with a static Huffman tree.  It's the same general idea,
but it can be encoded and decoded with a simple table lookup.  The
lower 6 bits are sent without any encoding.

The adaptive Huffman routine should look familiar; it was the basis of
the C code presented back in lesson 5.

-=O=-     LZB and binary tricks

And now for something completely different.  If you turn to page 222
in the Good Book (i.e. Bell/Cleary/Witten) you will see a description
of something called "LZB".  Basically, instead of Huffman codes, you
use simple variable-width integers.

What's a variable-width integer?  Glad you asked.  The variable-width
part just means that the value is expressed in a variable number of
bits.  Huffman codes fall into this category.

Recall that Huffman codes have a "unique prefix" property, so that you
can always tell where the code ends because it follows a unique path
through a binary tree.  What we need is a way to encode values such
that we can tell where the end of the integer is without needing
auxiliary data structures.

The simplest example is unary coding:

     value     binary coding
     1         01
     2         001
     3         0001
     4         00001
     5         000001
     16        00000000000000001
     64        [...]
     256       [...]

A bunch of zeros equal to the value, followed by a '1' stop bit.
However, expressing an arbitrary value between 0 and 4095 would tend
to be wasteful.  So, we consider some other ideas.  These come from
Appendix A in Bell/Cleary/Witten, which also includes some other neat
ideas.  (Check out the start-step-stop codes.)

First off, consider binary coding with the leading zeros removed:

     value     binary coding
     1         1
     2         10
     3         11
     4         100
     5         101
     16        10000
     64        1000000
     256       100000000

Note that they all have leading '1's.  Now we remove those:

     value     binary coding
     1         ---
     2         0
     3         1
     4         00
     5         01
     16        0000
     64        000000
     256       00000000

Problem is, we don't know how long these are.  So, we encode the
length with the unary coding we looked at first (colons added for
clarity):

     value     binary coding
     1         1:
     2         01:0
     3         01:1
     4         001:00
     5         001:01
     16        00001:0000
     64        0000001:000000
     256       000000001:00000000

This method is actually floor(log i) zeros, followed by a 1, followed
by the binary code without the leading 1.  Since the first part is a
unary count, there is one '0' for every bit in the second part.  This
means we can intersperse the data bits with the zeros, like this:

     value     binary coding
     1         1
     2         001
     3         011
     4         00001
     5         00011
     16        000000001
     64        0000000000001
     256       00000000000000001

This has the same length, but can be implemented with a simple shift
routine (assembly: shift left, if carry set then exit, else shift the
next bit into the integer we're forming).

There are other variations, which do things like use the above codes
to store the length, or have several stages of codes.

-=*=-

LZSS starts out with an empty buffer, and gradually adds characters.
If we've only got 16 characters in the buffer, then the match position
must occur at offsets 0-15, so we only need 4 bits to store the value.
The codes grow longer as the dictionary grows.

LZB uses this method to store the buffer offset.  Clearly this only
has an effect for the first N bytes of data (where N is the size of
the ring buffer), but we can now choose N=8192 or even N=32768 and pay
no penalty on files smaller than 8 or 32K.

To store the match length, LZB uses the last variable-width integer
scheme examined in the previous section.  Since there is no fixed
size, there is no longer a need for a parameter F.  The match can be
as large as the buffer (or larger if you want to play games with the
overlap feature of the decoder).

The absence of F and the changing size of the buffer lead to an
interesting problem: what is the value of "THRESHOLD", i.e. when is a
match long enough to be worth coding as a pointer instead of a series
of characters?  Before you shout "three", keep in mind that an LZSS
pointer for a two-byte match is 16 bits + 1 tag bit, whereas two
characters would take 16 bits + 2 tag bits... we're actually wasting
one bit each time we ignore a two-byte match in LZSS.  (It does make
things go faster though.)

Since this method still uses the tag bit from LZSS to determine the
difference between pointers and characters, the minimum size of a
pointer is 2 + k bits, where k is ceiling(log n).  Back up a step.  We
need 1 bit to tell it's a pointer.  The minimum match length is 1, and
we represent that with 1 bit.  "n" is the current value of N; if there
are 7 characters in the buffer, log(n) is 2.x, so we let k=3.  Thus,
the minimum size of a match is 2+k bits.

Bell/Cleary/Witten found that the best value for p (the minimum useful
match length) is around floor((3+k)/8)+1.  So if n <= 4096, the value
would be p=2, meaning that matches of length 2 or more should be
encoded as a pointer.  (consider: 12 bits of position + 5 bits for the
length + 1 bit for the tag = 18, and two characters + 2 tag bits = 18
bits.  You start losing ground when matches reach 10 characters (8+p),
but for English most matches are short anyway.)  There isn't a
definite "best value" for p, but it can be stored in a table with
log(N) entries and adjusted based on experimentation.

As with LZSS, the threshold is subtracted from the match length, so
that the lengths start at zero.  Whoops, can't do that: we have no way
of expressing zero with our number scheme (how very Roman).  Instead,
we start at 1, so lengths are stored as (length - p +1).

-=*=-

The neat part about LZB is that it gets compression roughly equal to
LZH and LZARI, but you don't have to have all the icky Huffman and
arithmetic encoding junk around.  In fact, the results in the back of
Bell/Cleary/Witten indicate that LZB is the best performing of the
LZ77 variants.

In terms of encoding speed, LZB will be considerably slower than LZSS,
because it has to do a lot of bit shifting.  Also, it will be
comparing potentially much longer strings, and probably more of them
if we increase N as well.  On the other hand, it doesn't have to do
all of the fancy adaptive stuff that LZH and LZARI have to, so it will
likely be faster than those.

For decoding, LZB will be doing the same things LZSS does, but will
have to do a lot of bit-shifting as well.  The reduction in speed will
certainly be worth the improvement in compression.  Of course, LZB
should decode considerably faster than either LZH or LZARI.

From a practical standpoint, it should be obvious that reserving part
of the ring buffer as a lookahead buffer is no longer practical... the
size of the lookahead buffer was equal to F.  Instead, we have to use
a separate buffer to hold the data as it is being read, or just
ungetch() the first character which doesn't match.

It's interesting that so few people use this.  Probably because they
don't see how it could be better than LZH.  After all, it isn't using
"optimal" coding on the match lengths, it still has the extra tag
bits, and it makes no attempt to compress the free characters.

-=O=-     PROGRAM(S)

There is no assembly code.  As far as I know, there are no (working)
assembly-language implementations of LZH, LZB, LZARI, or even plain
arithmetic encoding for the IIgs.  The reason being that they are a
lot of work, especially if you want to make them fast.

I have provided C code with minimal changes made for LZH and LZARI; I
don't know of any PD sources for LZB.  I believe LZB is used in the
MS-DOS archiver "ARJ", but the author only provides source for the
decompressor (with the explicit limitation that the code can not be
used in another archiver program... what ever happened to the hacker
ethic?)

The original LZH code had some implementation-specific code in it (it
only worked on 16-bit architectures); I fixed it up a bit.  Other than
that, the code is pretty much unchanged.  It's all rather
straightforward, especially since you've already seen all of the
pieces before.  This stuff just puts it all together.

NOTE: these programs can be a bit slow.  Compressing MoriaGS with LZH
(from 577K to 227K!) took 43 minutes on an accelerated IIgs.  It only
took 5 minutes to uncompress.

-=O=-     DEEP THOUGHTS

- LZARI uses arithmetic encoding, which should do better than a
Huffman encoder and much better than simple unary encoding.  This does
not appear to be the case here; the differences are slight, possibly
in LZB's favor.  Why?

- How could LZB's variable-width integer scheme be applied to RLE?  To
MTF?

-=!=-