HDC Lesson 2

Back to index
Hacking Data Compression  Lesson 2  Basics
By Andy McFadden
GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie


This lesson covers some really basic stuff which everybody knows how
to do.  Why bother with it?  For one thing, to be complete.  For
another, not everybody has copies of RLE code lying around, and sample
code is worth its weight in silicon.

Two basic ideas will be presented here, RLE and differential coding.
We also take a brief detour into the world of lossy compression.  If
you already know everything there is to know about this stuff, go off
and twiddle your thumbs for another week, when we'll pick up with some
more interesting (but still conceptually simple) algorithms.

-=O=-     RLE

RLE stands for "Run Length Encoding".  It works by taking a series of
identical bytes (a "run") and replacing them with a code which
specifies what the character was and how many of them there were (the
"length").

The usual way to form codes is with a distinguished "escape"
character.  Escape characters don't have anything to do with the Esc
key (ASCII 27); that's just what they're called.  Think about how
hitting the Esc key in most programs makes the program stop its
current activities and do something else.  The usage here is similar.

When a run of characters is encountered, say:

lda  #$1234        ;load the secret number

an RLE encoder would compress the "run" of spaces between the "1234"
and the ";" by outputting the escape character, then a space, and then
the number of spaces.  If your escape character were ']', your output
would look like:

lda  #$1234] 8;load the secret number
           ^^^
           |||--- count == 8 spaces
           ||--- character is a space
           |--- escape character

Decoding is just the opposite of encoding.  The input data is copied
to the output until an escape character is hit.  At that point the
decoder reads the character and the count, outputs that many
characters, and then resumes copying data.

That's basically it.  However, there is one slight complication: what
happens if the escape character appears in the input?  If we just
copied it to the output, the decoder would see it as the start of a
run of characters, and would output garbage.  The solution is to
output it as a one-byte run of characters, so you'd output an escape
character (indicating start of run), another escape character
(indicating the character to use), and then a one (number of
characters).

Note that we can still encode runs of escape characters in the usual
way.  Since we use 3 bytes to output a run, we only lose ground if we
encounter a solo escape character or two in a row.  If we encounter
three escape characters in a row, we would output "]]3", which is
exactly the same size.  Thus, the maximum expansion for RLE is +50% of
the original size, for a file which looks like "]a]a]a]a]a]a]a".

From this we can see that the choice of escape character is fairly
important (the value $db - high-ASCII '[' - is a popular choice).  It
should be the least frequently seen character in the input.  We can
make it less important by CHANGING it every time.  For example, we
could add some number relatively prime to 256 (say 51) to it every
time.  That way we only repeat escape characters once every 256 runs,
so a burst of '$db's won't screw us up.  So long as both the encoder
and the decoder adjust the escape value after each run, there's no
chance of confusion.

Note that this does make the data more susceptible to transmission
errors.  Text compressed with RLE is still more or less readable.  If
some error (say, modem line noise) caused the decoder to become out of
sync with the encoder, it would look for the wrong escape character
and would start spitting out garbage periodically.

(NOTE: I don't know if 51 is a good choice or not.  Seems nice
enough.)

-=*=-

Note that it isn't necessary to do the encoding exactly as described.
Some routines reverse the order of character and escape code; it may
make the encoder or decoder simpler in certain situations.  It's also
possible to avoid using an escape character altogether by assuming
that two identical characters will be followed by a length byte.  For
example, if the input were:

abcDDDDefGGhi

the output would be:

abcDD2efGG0hi
     ^    ^
     |    |--- zero Gs follow
     |--- two more Ds follow

indicating that there are two more 'D's, for a total of four.  If
there are only two characters, as with the 'G's, an extra 0 must be
added.  So the worst case for this scheme is a series of double
characters (like "aabbccdd"), which must be encoded in three bytes
("aa0bb0cc0dd0").  Again, this gives a maximum increase of +50%, so we
aren't gaining anything.  If the input text contains more occurrences
of double characters than occurrences of the escape character, we will
do worse.

-=*=-

A few efficiency notes...

Since we represent runs with three bytes, there is no reason to
represent runs of two or three bytes specially (in fact, specifying a
run of two bytes would be a loss).  In terms of decompression speed,
the overhead involved in outputting a run of three characters is
higher than that for just copying three characters to the output, so
RLE encoders should only encode runs of four or more.

Whenever we decode a run, we can assume that there will be at least
one copy of that character in the output (why else would we be doing
it?).  Thus the sequence "escape|char|0" has no meaning; we just told
the decompressor that there was a run of 0 bytes, so please get on the
ball and output all zero of those bytes right this very minute.

We ought to change the meaning slightly, so that the length byte is
one LESS than the length of the run.  Now a length of zero represents
one byte, and 255 represents a run of 256.  In fact, if it weren't for
the special handling needed for the escape character, we could treat
the length byte as length-4.

(And if you want to get really confusing, you could handle the length
byte for runs of escape characters specially.  But the extra overhead
involved is probably not worth the effort.)

-=*=-

It's a simple-minded scheme and, not surprisingly, the results aren't
very good for most files.  It does well when the data being compressed
has large sections of the same character, such as large sparse //gs
executables, disk images, or simple graphics images.

(ASIDE: ShrinkIt does an RLE pass before the LZW.  This was done
because ShrinkIt began its existence as a disk packer, not a file
packer.  Since most disks will have large sections of unused space
filled with zeros, RLE was helpful.  However, it does very little when
packing normal files, and can sometimes get in the way.)

The primary advantage of RLE over other methods is speed.  It's hard
to compress data faster than RLE does, because it spends most of its
time doing little more than copying data.

-=O=-     PROGRAM (RLE)

There are two basic ways to implement RLE: operate on a stream or on a
buffer of data.

The stream approach is more widely useful, and fits in with the way
other compressors work, so it will be used here.  Operating on a
buffer will usually be faster because you don't have to deal with the
overhead of character I/O, but you end up having to compare pointers
with the end of the buffer, so the routine itself becomes more
complex.

At any rate, our sample code takes a stream-based approach, so that is
how our example is written.  I threw in the RLE decoder from YankIt as
well.

It's interesting to note that the encoder can be surprisingly
complicated, especially when compared to the decoder.  The reason for
this is clearn when you represent RLE as a finite state machine with
one state representing the number of matches seen so far.  The machine
needs 256 states to count the run length, and the first three treat
the end of a run differently than the rest.  It's also complicated by
the presence of escape codes, which need three entirely different
states, and by an extra transition needed for when the end of data is
reached.

The decoder also has 256 states, but 255 of them are identical.
There's basically one state for "not handling a run" and one for
"handling a run"; the rest are all just there to count the number of
characters.

The C and assembly code function identically, but there is a slight
difference in the way the encoders act at the end of input.  For the C
version I just copied the code (about 8 lines), for the assembly
version I did some fancy skipping around.

Both work pretty much the way you'd expect.  The encoders get a
character and compare it with the last one.  If it matches, look for
more.  If it doesn't, output what we have.  The decoder is almost
trivial.

That about wraps it up for generic RLE.  Try it on some files and see
how it does.  One good example is "main.asm" in the assembly sample
files directory... Orca assembly files (well, the ones without tabs in
them) squash extremely well with RLE.

There is also a "test" subdirectory which contains some RLE test
files.

-=O=-     PackBytes

PackBytes is in the Miscellaneous tool set of the //gs toolbox (call
number $2603, described on page 14-39 of Volume I of the Toolbox
Reference).  It's most commonly used when compressing Super Hi-Res
images into Apple Preferred format (type $c0/$0002).

It bears a very strong resemblance to RLE - the basic idea is the same
- but it goes about it in a different way.  The output is a series of
chunks which begin with a one-byte header.  The header byte has two
flag bits and 6 length bits, defined as follows:

     00xxxxxx  - 1 to 64 bytes follow - all different
     01xxxxxx  - 3, 5, 6, or 7 repeats of next byte
     10xxxxxx  - 1 to 64 repeats of next 4 bytes
     11xxxxxx  - 1 to 64 repeats of next byte taken as 4 bytes

Let's take them one at a time.  For 00xxxxxx, you have the two flag
bits (00) and 6 length bits (xxxxxx).  With 6 bits you can represent a
number between 0 and 63; they used the "length-1" trick and treat it
as a length between 1 and 64.

If the flag bits are 00, then the next 1 to 64 bytes of data are to be
copied without change.

If the flag bits are 01, then it only accepts 2, 4, 5, or 6 and adds
one, giving 3, 5, 6, or 7.  Yes, it's weird, but it'll make more sense
in a minute.  It takes the next byte in the input, and repeats it that
many times.  This is similar to what we were doing earlier, except
that the escape char is in the first two bits, the length is in the
next 6 bits, and the character to repeat comes last.  I don't know
what PackBytes would do if you gave it a length other than 2, 4, 5,
or 6.  Be afraid.  Be very afraid.

If the flag bits are 10, then it repeats the next 4 bytes 1 to 64
times.  So if you have the sequence "ABCDABCDABCD" it would code
"10 000010 ABCD", meaning 'output "ABCD" 3 times'.  This is a bit more
sophisticated than generic RLE because it can take advantage of things
like repeated two-byte integers stored in a data file.

If the flag bits are 11, then it takes the next byte, expands it into
4, and then repeats it 1 to 64 times.  This is like generic RLE but
with a repeat count of 4, 8, 12, etc.  Note that by combining this
with the repeat counts of 3/5/6/7 above you can represent any run of 3
or more characters (now does it make sense?).

I'm not going to describe the interface.  I will say that, ON PAIN OF
DEATH, you MUST read IIgs Tech Note #094, "Packing It In (and Out)" if
you plan to use these routines.  It explains some common mistakes and
some bugs in the routines.

-=*=-

Does PackBytes do better than generic RLE?  Sure looks that way.  By
merging the length byte and all but doing away with the escape codes,
it reduces the amount of excess garbage we need to output.  However,
shaving one byte here and there doesn't really amount to much, and the
added complexity of the algorithm makes for a rather unpleasant speed
trade-off.

One place where it DOES do well is for handling patterns of two or
four bytes.  If your goal is to compress English text, you will not
see an improvement.  If you are compressing a big pile of identical
two or four-byte integers, this will help quite a bit.

How much can it expand a file?  By one byte for every 64.  This is far
better than RLE's worst case.

Note that I don't think I've ever actually used PackBytes, because the
interface is kind of annoying unless you know how big the data will
get before you start to uncompress it.

There's no sample code which uses it.  A brief (but sort of icky)
sample program appears in the toolbox ref; it's probably worth staring
at it just to see how it works.

-=O=-     Differential

And now for something completely different.

Differential encoding does not make the data any smaller.  Thus, it
isn't really a "compression" scheme in itself; rather it is a way of
modifying the data so that a compressor can make more sense of it.

Switching back to theory mode for a moment, differential encoding
techniques try to change the nature of the data so that the entropy
with respect to a particular model is lower.  The techniques described
here are very simple, but the ideas behind them are powerful and crop
up now and again.

-=*=-

Let's start with a simple example: digitized sound files.  Your basic
system beep is a digital representation of analog data which tends to
have a wave-like nature.  This means that it is rare to see a long
series of identical values; more likely you will see a series of
increasing values followed by a series of decreasing values.

To most compressors, this looks like random garbage.  There are few
identifiable patterns, and the distribution of values is relatively
flat when compared to something like English text.

However, a series of ascending values may be part of a straight line
in the wave, so each is higher than the last by a fixed increment.
This suggests that if we could compress the slope instead of the
points, we might get somewhere.

The easiest way to do this is by storing the DIFFERENCE between the
current value and the last one.  So if the input were:

     5  3  5  8  10  12  13 15

the output would be:

     5 -2  2  3   2   2   3  2

Because each point in the digitized waveform is related to its
neighbors (assuming a continuous wave), the difference between
adjacent points will tend to be small.  We can run the new data
through a Huffman encoding pass (which encodes frequent symbols in
fewer bits, infrequent symbols in more bits) and get significantly
better compression than we could have achieved without the
differential pass.  Since the values tend to cluster near zero,
Huffman will have a probability distribution over a small area instead
of one spread almost linearly over all 256 values.

However, life is not a bed of roses.  Or even geraniums.  Digitized
sound tends to have a lot of noise, which produces discontinuities in
the waves.  A noisy sample will end up having large, sudden shifts in
values (audible "pops"), and the output of the differential encoder
will have a wider range of values, causing Huffman to do poorly.

-=*=-

If you take a look at the sample differential output above, you will
note that they're all rather small.  Suppose you had this bright idea
that you could represent the values in exactly 4 bits.

This has the dramatic effect of cutting your sound file in half.  You
are guaranteed to get exactly 50% compression; why let Huffman have
all the fun?  However, there's one small problem: what do you do when
the next value exceeds the possible range?

Enter lossy compression.  Lossy compression differs from lossless
compression in that the compressed data does not need to be an exact
representation of the source data.  This sort of thing would be
unacceptable for English text (although the Cliff's Notes people
haven't suffered noticeably), but for digitized sound samples it's
perfectly acceptable.

You can drastically alter a sound file without making changes
detectable by the casual observer - slight changes in pitch, the
addition or removal of a faint click, etc.  Putting a limit on the
difference might actually remove some of the excess noise from a
sample.

Still, a range of -8 to +7 or so isn't very much for 8-bit sound.
Most likely we would end up flattening the waves, ending up with a
rather muffled, lifeless noise.  We can do better.

-=*=-

Suppose that, instead of storing the difference between this byte and
the next, we stored the difference between where we PREDICT the next
byte will be and where it ACTUALLY is?

Suppose we've seen the sequence:

    10 14 21 25 30

the simple-minded approach would record the difference between 30 and
whatever came next.  If the next value were 40, we'd be out of luck,
and would have to store a "7" and be content with the loss.  In fact,
if the stream of data increased by 10 each time, we'd never catch up
until it hit 255 and came back down (muffled, lifeless...).

By examining the stream, we see that the most likely value for the
next byte is 35.  If the value IS 35, we store 0.  If it's 40, we
store 5.  We're still out of luck if it's 45, but by analyzing the
stream we can guess that the next value will be higher - say 55 - and
will eventually catch up.

Another possibility is to use non-linear differences.  If our sample
starts to do something like:

    1  2  4  8  16  32  64  128

we're never going to be quite in step.  What we can do instead is
double the size of each differential step, so that instead of -8 to +7
we would have -16 to +14 (using -16, -14, -12, -10 instead of -8, -7,
-6, -5).  This makes our differential value inexact, but it will be
off by at most 1.  Such a change is probably inaudible, and is
certainly less noticeable than being off by 32 or so.

Reverting to theory mode once again, we can see that the entropy of a
sound file is directly related to how good our MODEL of sound is.  If
we have the original brain-dead model, which assumed that the next
value was going to be identical to the current value, we don't do very
well.  A simple model which just looks at the last two values and
interpolates a third will do considerably better.  A fancy model which
watches for trends or takes into account the sine-wave nature of some
sounds can do even better, possibly accomplishing in 3 bits what the
earlier models can do in 4 or more.

Other techniques take into account the nature of sound or of the human
ear; for example, faint noises or high pitched sounds may be inaudible
when played at the same time as loud low frequency sound.

One popular model is Adaptive Differential Pulse Code Modulation, or
ADPCM (also known as CCITT G.721).  This is what the ACE tool set in
the //gs toolbox uses.  A fairly simple description can be found on
pages 27-4 and 27-5 of volume 3 of the Apple //gs Toolbox Reference.
From what I've been able to sort out, it appears to use predictive
differential coding with a range of values which vary depending on the
rate of change in the input.  In other words, it does most of the
stuff I just described.

-=*=-

Well, I got pretty lost in the implementation.  The important idea in
the previous section is that we PREDICT what comes next and then store
the amount of ERROR in our prediction.

In other words, we record how surprised we were.  Sound familiar?

-=O=-     PROGRAM

The source code included here does the simple differential coding,
intending it to be used as a preprocessor to something like a Huffman
encoder.  It transforms each byte into the difference between it and
its predecessor (modulo 256).  The algorithm starts with an initial
value of zero.

It is so utterly trivial I'm not going to bother describing how it
works.  Just look at the code.

It seems to help a bit when compressing sound files with ShrinkIt, so
if you don't happen to have a Huffman encoder lying around, try that.
But don't worry... in a couple of weeks, you'll have one.

You will get the best results on samples which are fairly smooth and
continuous; digitized musical instruments come to mind.  Speech pulled
from movies may not do so well.

-=*=-

I have also included the C source to a fast ADPCM routine which was
intended to work on 16-bit sound samples (it does 16:4, which is 75%
compression).  I haven't bothered converting it to assembly or
adjusting it for 8-bit sound samples because we have the ACE toolset
available.  However, the code is easy to understand (it has comments
in all the right places), and it's really short.  The neat thing about
it is that you can tell it to not use any multiply instructions, so a
//gs version would be pretty quick.  It'd be interesting to do the
conversion and see how it compares to the ACE toolset; any takers?

-=O=-     DEEP THOUGHTS

- would an advantage be gained by using a two-byte count for RLE?
What are some alternative ways of encoding the length for data with
really long sections of zeros?  More interestingly, what effect would
considering data a (2-byte) word at a time have?  (i.e. you compare
the bytes at 0 and 1 with those at 2 and 3, then 4 and 5, and so on.)

- would data generated by PackBytes look significantly different from
that generated by RLE?  From this, would you conclude that PackBytes
compresses English text better, worse, or about as well as RLE?

- suppose you wanted to compress data first with RLE and then with the
7-bit compression from last week.  What changes would have to be made
to RLE to accommodate the 7-bit scheme?  Does the compression have to
be done in two separate passes (i.e. do the first pass to a file, then
execute the second pass and send the output to another file)?

- the examples for differential compression were done on bytes in a
sound file.  What are some other kinds of data which might benefit?
Ideas: words in the dictionary, pictures in games like The Bard's
Tale.  Note that for large data items like graphics images you can get
away with doing a differential pass and then a simple RLE scheme which
just removes zeroes from the data.

-=!=-