[fadden-logo]

Back to index


Hacking Data Compression  Lesson 3  Somewhat Clever Ad-Hoc Schemes
By Andy McFadden
GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie


This week's topic covers three schemes which were designed to do a
reasonable job of compressing English text.  I classify them as "ad
hoc" because they were created specifically to take advantage of the
properties that English text has, and will do poorly on almost
anything else.

First up is a quick description of an algorithm that MacWrite used,
then comes MTF (Move To Front) presented with an introduction to
dictionary models, and finally digram coding.  The latter is the most
interesting from a theoretical viewpoint because the discussion of it
leads into the ideas behind the most powerful compression schemes yet
devised.

But I'm getting ahead of myself.  The only one I've implemented is a
simple digram encoder; the other two didn't seem worth the effort.

-=O=-     MacWrite

Let's get this out of the way: Apple, Macintosh, and MacWrite are
trademarks of Apple Computer, Inc.  I don't know if MacWrite is or
isn't, but I feel better (and safer) now that I've said it.

Back in '84, the Macintosh was released to the world with 128K of
memory and a single 400K disk drive built in.  The designers of the
original Mac word processor decided that 400K wasn't much room for a
whole pile of data files, so they decided to build a simple form of
text compression into MacWrite.

The basic idea is to take the 15 most common characters in the English
language, and store them in 4 bits (say, values 0-14).  The 16th value
(15) was used as an "escape" character, indicating that the next 8
bits represented a single character.  This means that there are 15
characters which will be stored in 4 bits, and (256-15 =) 241
characters which will be stored in 12 bits (4 bit esc + 8 bit char).

(Aside: the use of an explicit escape character is fairly common in
data compression schemes which focus on a particular type of data.
This allows the author to concentrate on the common cases with the
tradeoff that some nasty things may happen on the rare ones.  An
example of this appeared last week in the differential encoding scheme
- you can make the lossy sound compression schemes lossless by using
an escape character - and can actually be thought of as the foundation
of RLE.)

Clearly, the choice of those 15 characters is crucial to the success
of this scheme.  For English text, the traditional wisdom has been
that the letters in "ETAOIN SHRDLU" are the most frequent, in that
order.  What's nice about it is that it can be pronounced, serving as
its own mnemonic device.  More recent studies have shown that "ETAONI
RSHDLC" is probably closer to American English.  The characters used
by MacWrite are " etnroaisdlhcfp" (the space is significant here, and
in fact is the most common *character* in English text).

How well does this do?  Well, if the set of 15 characters account for
more than 50% of the text, the scheme will win (one frequent character
plus one infrequent character requires two bytes, which is exactly how
much space it would require before packing).  It turns out that they
account for well over 50%, and this scheme typically reduces text by
20%.

-=*=-

It's interesting to compare this scheme with the 7-bit packing from
lesson 1.  Recall that the packing method reduced text by exactly
12.5%, but was incapable of coping with 8-bit data.  Here we have a
scheme which does much better, and isn't much more difficult to
implement.  However, it isn't consistent: a file full of the letter
'e' will be maximally compressed (50% of original size), but a file
full of the letter 'q' will be expanded to 150% of the original size.

This scheme can be improved by turning the compression off for parts
of the text which don't compress well.  One approach is to do what
PackBytes does and chop the file into fixed-size chunks, each of which
is compressed individually.  Another approach is to insert a different
kind of escape code into the data (i.e. different from the binary
"1111" used to indicate that an 8-bit value follows), perhaps as 8-bit
code 256.  When encountered, the compression status (on or off) is
toggled.  This would cause dramatic expansion if a series of 256s are
seen in a row (do you see why?), but this problem could be avoided by
using the escape character rotation scheme discussed in the RLE
chapter of lesson 2.

Another way to look at this algorithm is as a dramatically simplified
static Huffman encoder.  The advantages and flaws are identical.  (If
you're not familiar with Huffman, fret not: it's introduced next
week.)

-=O=-     Dictionary Schemes

Before I present MTF, a brief description of a "dictionary" model is
in order.  Dictionary-based schemes work by replacing a series of
characters (say, an English word) with an index into a list of words.
If your dictionary contains "and", "the", "that", "which", and so on,
then occurrences of them would be replaced by the numbers 0, 1, 2, and
3, respectively.

If you replaced all occurrences of " the " with a single 8-bit value,
you would get a small but non-negligible reduction in size.  If you
increase the size of your dictionary, adding the 256 most common words
in the English language, you will do even better.  One universal
property of dictionary schemes is:

    bigger dictionary == better compression

Returning to the lessons from the first class, note that the
dictionary represents the MODEL used for compression.  It says nothing
about how the dictionary indices are ENCODED in the output.  Some
algorithms (like LZW and LZSS) simply store the value, while others
(like LZH) will use a more sophisticated encoder to get better
results.

-=*=-

There are many different ways to implement dictionary models.  Static
dictionaries (such as one with common English words) are easy to
program but not very effective.  Spanish has a slightly different
"feel" to it, but the sizes of words and distributions of characters
are fairly similar.  A static dictionary designed for English will die
horribly on _Don Quixote_.

Semi-adaptive dictionaries are built up by scanning the input, and are
sent along with the compressed data.  These will fare better than
static dictionaries, because they use the most common words found in
the input.  Unfortunately, the cost of sending the dictionary grows as
the size of the dictionary grows.  Since better compression is
achieved with larger dictionaries, there's a break-even point where a
large dictionary will begin to make the output larger.

All of this can be avoided with an adaptive dictionary model, which
builds the dictionary from the input as it works.  The first time it
sees the word "hoser", it sends it through uncompressed, and adds it
to the dictionary.  The next time, it sends a dictionary offset, just
as a static or semi-adaptive scheme would.  The uncompression routine
just does the operation in reverse; when it sees "hoser" in the
compressed data, it adds the word to its dictionary, and can recognize
the dictionary offset when it appears from then on.

All of the algorithms in the LZ family (LZW, LZSS, LZH, etc) use
adaptive dictionary models.  What makes them remarkable compared to
other algorithms is that not only do they compress well, they are also
extremely fast.

-=O=-     MTF (Move To Front)

MTF compression is interesting because it's a very intuitive way to
compress data.  It uses a simple adaptive dictionary approach on whole
words, where a word is defined as everything between spaces or
newlines.  Note that it doesn't operate on the whitespace itself, and
therefore lends itself to a lossy implementation (where " " and "  "
would be indistinguishable in the output).  This can be worked around,
though we won't deal with it here.

The algorithm looks like this:

    for every word in the input
        if the word is one we have seen before, then
            output its position
            move it to the front of the list
        else
            add it to the list of words, at the front

This gradually builds up a list of words it has seen.  The most recent
word seen is always moved to the head of the list (an MRU - Most
Recently Used - ordering), so that commonly used words will tend to
cluster near the front.  By encoding small dictionary indices in fewer
bits than larger indices, we get better compression than we would if
we had simply substituted an integer for a word.

That's all there is to it.  In theory at least.

-=*=-

MTF as presented has one fatal flaw: the dictionary grows without
bound.  If the input has more words than can fit in memory, we're in
trouble.

An easy solution is to cut off the list at an arbitrary value, say
4096 entries.  Because we're moving the most recently used word to the
front, the entries which get dropped will be rare, so transmitting
them again will not be a major burden.  This is not a complete
solution - if each word is 8000 characters long, we may still run out
of memory - but you get the idea.

One interesting property is that, at any time, the dictionary will
contain the 4096 most *recently* seen words, NOT the 4096 most
*frequently* seen words.  If I go insane and write "All work and no
play makes Jack a dull boy" for several hundred pages, those 10 words
will still top the list of most frequently used words even after
followed by a couple hundred pages of newspaper clippings.  However,
MTF will have forgotten all about them 4096 words later unless they
appear again.

Having the adaptation weigh recently seen data more heavily than older
data can be a big win sometimes.  For example, I've used "4096"
repeatedly in the last few paragraphs, but didn't until this section
and won't again.  The danger is that, if adaptation moves too slowly,
the model will spend all of its time adapting to the changes and never
"lock on" to what is occurring frequently and what isn't.

-=*=-

A few notes about MTF.  I didn't implement it because it gets kind of
"messy".  First, there's the dictionary.  You want a data structure
which can be inserted into at the head and deleted from in the middle,
like a doubly linked list.  Second, unless you want to impose a
minimum word length (good) and a maximum word length (bad), you have
to deal with variable-sized words.  Doing this efficiently (i.e. not
calling malloc() every time) can become a chore - especially if memory
begins to fragment.

Third, you need a way to efficiently encode integers of various
magnitudes.  The appendix to _Text Compression_ contains a list of
good schemes.  Since this is an encoding problem, it is also possible
to use one of the generic encoders (Huffman or arithmetic encoding),
and in fact MTF does surprisingly well when combined with one:
Bell/Cleary/Witten found that it did significantly better than LZW and
LZSS on English text.  Where this part gets messy is when trying to
figure out what is an uncompressed word and what is a dictionary
offset when trying to uncompress.

-=O=-     Order-N models

Most algorithms which model natural languages like English try to
predict the next character based on the input seen so far.  Simple
schemes, like the typical Huffman compressor, compress each character
according to its frequency.  This works well enough, but you can do
better: as I pointed out in lesson 1, you will get more accurate
probabilities if you take into account that 'q' is almost always
followed by 'u', so 'u' should have a higher probability than 'e' if
the previous character was a 'q'.

The simple Huffman encoder is an example of an "order-0" model.  It
has a set of probabilities which do not take into account any of the
characters immediately preceding the current one, and do not reflect
the order in which the characters appeared.  This means that "the cat
in the hat" will be compressed in exactly the same way as
"aate hte ctn  htih", even though the former is English while the
latter is gibberish.  The problem is that the order-0 model is too
general, and is missing out on characteristics specific to English
text.

Using the previous character to determine probabilities is known as an
"order-1" model, because it looks one character back.  Instead of a
single set of probabilities, there are 256 sets, one for every
possible previous character.  The set of probabilities is larger and
more difficult to maintain, but is considerably more accurate.

How much better does it get?  I don't actually have a quantitative
measure.  However, if you work it backward, you will get the most
frequent character pairs instead of the most frequent characters.
Since there are some combinations which never occur (find me an "xq"
in Shakespeare), the resemblance to English will increase.

Does the modeling get better as the order gets larger?  Yes.  You
begin to get fragments of words, whole words, and parts of sentences.
After a point however the models become large and unwieldy, and are
unusable in practice.  Also, after a point you end up with sequences
which are unlikely to repeat... for example, there are no
100-character sequences in this file which are repeated.

You can also have an order-(-1) model, in which all probabilities are
equal (e.g. 1/256 for every element in the set of 8-bit characters).

Now take a big step back.  All of the above applies to arbitrary
chunks of data.  You can have order-N models for words, sentences,
8-byte blocks, or anything else.  The order just tells you how far
back it looks in terms of the unit data element.  It also works just
fine for input other than English text.

There are some examples of running English order-N models backward in
_Text Compression_, pages 80-81 (by character) and 84 (by word).  Some
are kind of funny, like, 'She stammered, not bodily into water at then
kissed here and in color; bright red, with local assessing units".'

-=O=-     A simple digram dictionary model

A "digram" is a pair of symbols.  For this discussion, we will use it
to mean a pair of characters in a text file.

A digram encoder is an order-1 model because it considers two
characters at a time, the current one and the previous one.  A
"trigram" encoder would be order-2, "tetragram" would be order-3, and
so on.

The simplest use of digrams is a dictionary scheme which replaces
pairs of characters with a single-byte dictionary offset.  It takes
the input two characters at a time, and looks the value up in its
dictionary.  If it finds it, it outputs the dictionary offset.  If it
doesn't, it outputs the character.

The sample program for this week does exactly this.  Details are in
the "PROGRAM" section.

-=*=-

When we did the 7-bit ASCII packer (the one that packed it down to
exactly 7 bits), we had to abort if we encountered a character with
the high bit set, because we had no way of dealing with it.  We can
view this scheme as an order-(-1) model, in which the first 128
characters have an equal probability (1/128), and the last 128 have a
probability of zero.

This illustrates what is know as the "zero frequency problem."  To
make it possible to encode a character, we must assign it a non-zero
probability, no matter how unlikely it may be.  If we're just dealing
with 256 probabilities (one for each character), it isn't a big deal.

Now suppose we have an order-10 model.  If we are modeling English,
the string " etaoinshrd" is unlikely to appear.  However, it must be
given a non-zero probability or we will be unable to encode it.  In
fact, we must assign and maintain probabilities for ALL 11-character
strings.

This is prohibitively expensive.  Even if we only allow the 96 or so
printable characters, that's 96^11 probabilities to keep track of.

You might be tempted to suggest that we should just add the new
entries as we go along.  Well, guess what, we can, but it's not as
easy as you might think.  Keep in mind that this is NOT a dictionary
scheme; the code output for each character depends on the probability
of that character occurring, which in turn depends on what the last 10
characters were.  We're encoding this stuff one character at a time,
not 11 characters at a time.

-=*=-

The solution is called "blending."  There are all sorts of ways to go
about it (see chapter 6 in _Text Compression_), but the basic idea is
to combine an order-(-1) model with order-0, order-1, order-2, ...,
order-9, and order-10 models.

Suppose a given sequence of 10 characters has never been seen before,
so the order-10 model has no way of guessing what the next character
is.  All characters have probability 0.  If the order-9 model HAS seen
it, then it will have a nonzero probability.  If we combine all of the
probabilities together (including the order-(-1) model, which gives
equal probabilities to all characters), we are guaranteed to have a
nonzero probability.

At first this appears to make the order-10 model less efficient,
because it gets "shouted down" by the other models.  This can be
sidestepped by weighting the probabilities from the higher-order
models more heavily, reducing the effect that the lower-order models
have.  If the probability for a model is zero, then the model has no
effect and the lower-order models control the outcome.

Another way to solve the problem is by "escaping" down to a lower
model when a series of characters is seen for the first time.  This is
done by transmitting an explicit escape code to the decoder.  Note
that the escape code must have a nonzero probability or it will be
impossible to transmit!  Thus we need to be able to estimate the
probability that we will be clueless as to what the next character
will be.

This type of algorithm - which is really just a digram model on
steroids - is known as "context modeling", because the model depends
on the current "context" (i.e. the last few characters).  There are
related ideas called "state modeling" or "Markov modeling".  We won't
be going into any of these in detail because they use LOTS of memory
and are VERY slow.

-=*=-

The static dictionary digram compressor in today's sample code can be
expressed as an order-1 context model with a static set of equal
probabilities.  The characters not in the dictionary have a
probability of zero, so we escape to the order-(-1) model and output
them that way.  The escape code is in the high bit.

It turns out that ALL dictionary encoders, including fancy ones like
LZW and LZSS, can be expressed by an equivalent predictive model.  A
mathematical algorithm exists which will do the conversion.

-=O=-     PROGRAM

This week's sample program has a static dictionary of 128 character
pairs.  It reads the input, and if it sees a pair of characters that
appear in the dictionary, it outputs the dictionary offset with the hi
bit set.  If not, it outputs the character with the hi bit clear.

You may be wondering, where does the dictionary come from?  Then
again, you might not.

So again, it only accepts 7-bit ASCII data.  It could be made to
accept 8-bit data, but then every character in the output file would
have to be 9 bits, and that tends to obfuscate a relatively simple
algorithm.  It's also instructive to do a hex/ascii dump on the
encoded output and see what's left - notably all the upper case
letters (why?).  If the output codes are always 8 bits, it's much
easier to interpret.

When I say the programs are simple, I mean simple.  I use a linear
search through the 128 entry dictionary, and I use arrays instead of
pointers in the C version, so the encoder is a tad slow.  The decoder
is as fast as anything else we've seen so far because it either
outputs the character it gets (if hi bit is clear) or outputs the two
characters in the dictionary at the index specified by the character
(if hi bit is set).

Oh yeah, the dictionary.  The C version has a separate include file
called "freq.h", and the assembly version just has a bunch of DC
constants in the data segment up front.  The two are identical.  Both
were generated by "scanpair.c", which takes as an argument a file to
scan, and produces on stdout (read, "it dumps on your screen") a
ready-to-go "freq.h" with values for the input.

There is no assembly source code equivalent for it.  However, there
are two different versions of the C code - one generates output for
"freq.h", the other generates it for an assembly DATA segment.
They're identical except for the output formatting routine.  Both
executables are included.  To use them, do something like this:

     scanpair cmain.c > myfreq.h            for C stuff
     ascanpair main.asm > mydata.asm        for asm stuff

The output of scanpair is more enlightening than the output of
ascanpair, so even if you don't plan to use the C code, be sure to
look at C/freq.h.  The freq.h I've included was generated from a large
blob of English text, and should be reasonably representative of
English (compare it to the results on page 79 of _Text Compression_).

Compressing the source file from which the statistics were gathered
gave me about 37% compression.  Since the maximum possible is 50% (all
input characters reduced to digrams), I'd say it worked fairly well.
Compressing the assembly or C source files gives about 27%
compression, because they have a different distribution of characters
than English does.

Some ideas for improvements are in the next section.  Implementing
them is left as an exercise for the reader.

Administrative note: nothing changed in main.asm, so I omitted it from
the lesson.  Instead, I included MAIN.ROOT and MAIN.A, which are
smaller.  (This was done to reduce the size of the archive, which is
going to be bloated with the two C programs I included...)

-=O=-     DEEP THOUGHTS

- Why is a minimum word size useful for MTF?  Why is a maximum size
bad?  Is it bad?  What role does it play when compressing binary data?

- How can whitespace be handled correctly for MTF compression?

- I compared the MacWrite character substitution to Huffman.  However,
it also looks a lot like a dictionary scheme.  Which is it?

- What would you get if you ran each of these backward?  Can you see
why MTF does so well when compressing English?

- Computing the frequency tables ahead of time is a pain, because the
values are compiled in to the compressor.  There are two alternatives
to static dictionaries, adaptive and semi-adaptive.  Semi-adaptive
would compute the table right before compressing, and put it in the
file before any compressed data (resulting in a 256 byte header).
Adaptive starts with an empty dictionary and fills it in as it goes
along.  What would be an easy way to implement an adaptive digram
encoder?  (HINT: it's in this lesson!)

- Was the freq.h file included in this lesson generated from Apple II
text files?

- Why doesn't MTF suffer from the zero-frequency problem?  Or does it?

-=!=-

This document is Copyright by Genie and Andy McFadden