HDC Lesson 6

Back to index
Hacking Data Compression  Lesson 6  Arithmetic Coding
By Andy McFadden
GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie


Okay, here's where things start to get weird.

This week's subject is arithmetic encoding, a method of encoding
symbols that can produce output which is extremely close in size to
the entropy of the input.  It is neither easy to understand nor easy
to explain, but I'll give it my best shot.

It's really too slow to be useful on an Apple II unless you play some
games with it (which I will also endeavor to explain), so there's no
assembly code.  There's plenty of C code, however.

The book I mention all too often (_Text Compression_, by Bell, Witten,
and Cleary) is probably the definitive reference on arithmetic coding.
My explanation is largely derived from chapter 5, and revolves around
their implementation of it.  The section in _The Data Compression
Book_, by Mark Nelson, is also good.

-=O=-     It does WHAT?

As I have alluded to in earlier lessons, arithmetic coding can
represent several bytes with a single bit, and can handle
probabilities more accurately than encoders like Huffman or
Shannon-Fano (which always round probablities to powers of two).  To
accomplish this, it must do something a bit strange.

Simply put, it encodes the entire input stream into a single floating
point value between 0 and 1.

By being fairly clever about it, we don't need infinite-precision math
routines to handle it.  When you realize that the binary
representation of a floating point number looks like "0.011001011011",
you can see that it's just a "0." followed by a stream of bits.  By
interpreting the value a few bits at a time instead of as a single
value, we can do all of the processing with integer math.

-=*=-

I'm going to start off with an overview of how things work.  For the
moment, assume that we have infinitely precise floating point numbers
at our disposal, so we can work in decimal.  The output of the
arithmetic coder is going to be a number X, where 0 <= X < 1 (the
authors of _Text Compression_ use the mathematical notation "[0, 1)",
which means the same thing).

The model feeding us probabilities will have divided the symbols up so
that the sum of their probabilities is equal to 1, which means that
the numbers divide the region 0<=X<1 into neat little pieces.  Suppose
we have three symbols {A, B, C}, with probabilities:

     A    0.3
     B    0.2
     C    0.5

This means that they each get their own piece of territory:

     A    0   <= X < 0.3
     B    0.3 <= X < 0.5
     C    0.5 <= X < 1

Each one ends up just below the next value.

So we start out with 0 <= X < 1.  When encoding a message (say,
"CBCCA"), each step causes the encoder to reduce that range to just
that occupied by the symbol we read.  So when we read the initial 'C',
we reduce the range to 0.5 <= X < 1.

The next time we read a symbol, we reduce the range we just computed.
Since the next character is 'B', we want to reduce the region to the
area ranging from 0.3 to 0.5 INSIDE 0.5 <= X < 1.  Let me try to
illustrate it, starting over from the beginning:

     For lower bound = 0, upper bound = 1,
     A    0   <= X < 0.3      -->  0    0.3
     B    0.3 <= X < 0.5      -->  0.3  0.5
     C    0.5 <= X < 1        -->  0.5  1

We found a 'C', so we use 0.5 and 1, leading to:

     For lower bound = 0.5, upper bound = 1,
     A    0   <= X < 0.3      -->  0.5   0.65
     B    0.3 <= X < 0.5      -->  0.65  0.75
     C    0.5 <= X < 1        -->  0.75  1

The second character is a 'B', so we use 0.65 and 0.75, giving us:

     For lower bound = 0.65, upper bound = 0.75,
     A    0   <= X < 0.3      -->  0.65  0.68
     B    0.3 <= X < 0.5      -->  0.68  0.7
     C    0.5 <= X < 1        -->  0.7   0.75

Next up is a 'C', so we use the interval 0.7 to 0.75, resulting in:

     For lower bound = 0.7, upper bound = 0.75,
     A    0   <= X < 0.3      -->  0.7    0.715
     B    0.3 <= X < 0.5      -->  0.715  0.725
     C    0.5 <= X < 1        -->  0.725  0.75

The fourth character is a 'C', so we would take 0.725 to 0.75.  The
fifth character is an 'A', so we would take 0.725 to 0.7325.  If we
had more data, we would just keep on going this way.  At each step,
the interval (or in fact any number between the high and low values)
uniquely represents the message.

Before we discuss decoding, notice how the "0.7" locked into place.
It will ALWAYS be 0.7, and really has no effect on future
calculations.  This is important when dealing with the output
incrementally, because it allows us to shift away the 0.7, leaving
just the "active" parts of the number.

BTW, the formula I'm using to compute the intervals at each step looks
like this (still using floating point arithmetic):

     range = high - low
     high  = low + range * CumProb[symbol]
     low   = low + range * CumProb[symbol-1]

CumProb holds the cumulative probabilities for each symbol.  So
CumProb[A] holds 0.3, CumProb[B] holds 0.5, and CumProb[C] holds 1.0.
Note that we are skirting around the less-than part of the
inequalities, because the code above computes EXACTLY 1.0 instead of
just-slightly-less-than 1.0.  It's a lot easier to implement this way,
and we can handle it correctly with some fancy footwork.

Notice that, because CumProb and range are always positive, "low"
never decreases.  Since CumProb is always less than or equal to one,
and range gets smaller at each step, "high" never increases.  They
just get closer and closer to each other.  (What happens if our
floating point routines slip up and they MEET each other?)

-=*=-

The final result of our compression was the interval 0.725 <= X <
0.7325, which means that any number in that range is sufficient to
uniquely identify the message.  We can pick any number we want, but
for simplicity let's just grab the "low" value (note that the "high"
value would be invalid).

So we've got this number and the same probability distribution as
before.  If we start from 0 <= X < 1 again, we start off with:

     For lower bound = 0, upper bound = 1,
     A    0   <= X < 0.3      -->  0    0.3
     B    0.3 <= X < 0.5      -->  0.3  0.5
     C    0.5 <= X < 1        -->  0.5  1

So, where does the number fit?  Well, 0.725 is between 0.5 and 1, so
the first character must be a 'C'.  So, we now apply the same
transformation that we did before when we were encoding, and reduce
the range to 0.5 <= X < 1:

     For lower bound = 0.5, upper bound = 1,
     A    0   <= X < 0.3      -->  0.5   0.65
     B    0.3 <= X < 0.5      -->  0.65  0.75
     C    0.5 <= X < 1        -->  0.75  1

So where does 0.725 fit in?  In the 'B' range, so the second character
must've been a 'B'.  We again reduce the range as before, and continue
on in this manner until we've decoded the entire message.

Two small issues.  First, how do we know when to stop?  For example,
the output "0.0" could mean "A", "AA", "AAA", and so on.  We can code
an infinite number of As with one bit!  We can tell the decoder to
stop with one of the techniques used earlier (either record the length
at the start of the file, or use a distinguished stop symbol).

Second, how do we find out which range it belongs in?  Well, the
easiest approach is to do a linear search through the CumProb table to
find a range which matches.  This is the approach we will use.  It's
also possible to do a binary search on it, but if we're using an
adaptive model it can be expensive to sort the table after each
character.

-=*=-

Hopefully by now you can see how it works.  We start with the interval
0 <= X < 1, and at each step we reduce it according to the probability
of the symbol.  Infrequent symbols will reduce the interval more
sharply, requiring more bits than frequent symbols.  After each
reduction, we effectively expand it to full size, and then chop
another piece out of it.

Eventually, the pieces get pretty small.  It's hard to illustrate it
when you've only got 80 columns and standard text characters.

-=O=-     Implementation issues

Now we have to throw away the infinite precision floating point number
stuff, and deal with 1s and 0s.  We saw that the leftmost digit will
"freeze" and never change again; this happens exactly when the digit
is the same in both the high and low boundaries (if you don't believe
me, go back and look).  Whenver the most significant bit of high and
low is equal, we can shift it out and write it to the file, because it
will never change and will not have any effect on future calculations.

The effects of simply removing it and of shifting it out are the same.
Whether we change 0.725 - 0.7325 to 0.025 - 0.0325 or 0.25 - 0.325 is
immaterial; each step depends on the DIFFERENCE between them.  This is
what allows us to use less precise arithmetic.

-=*=-

We will represent the bounds as integers, so it is convenient to
switch from [low, high) to [low, high]; that is, low <= X <= high.
What we really want to do is represent it as [low, high+0.11111...),
where 0.111111... is an infinitely repeating binary fraction.  This
suggests that, when we shift the most significant bit off, we should
shift 0s into low and 1s into high, making it look like there's an
infinite string of 1s trailing off the end of high.

Here's how we stuff this concept in.  When encoding, we want to output
bits and shift whenever the most significant bits of high and low are
the same.  So, if we're using 16-bit integers to represent the values,
we can just logically AND off the other bits with 0x8000 and compare
them.  If they match, we output the hi bit, shift both values to the
left, and then add one to the high value to preserve the 0.1111...
effect.

(You can get the same effect in Pascal by comparing against the
halfway point, which in this case is 0x8000.  If both values are above
or below, then the hi bit must match and you know what to output.  You
can get the bit shift effect by multiplying by 2, though you will have
to multiply (low - HALF) and (high - HALF) when the MSB is 1 if you
want to be portable to architectures with 32 bit integers.  It's also
interesting to draw it this way, because you want to do the shift
whenever high and low are both on the same side of the midpoint of the
interval.)

Note that more than one bit may match at any step.  Thus, we need to
put the shift-out stuff in a while loop.

When decoding, after we have figured out what the current symbol is,
we can shift out any bits which match, and shift in some new bits from
the input.  It's very much like encoding.

-=*=-

How do we identify what the next character should be?  Before we get
into that, we need a slight change in the way we store things.  We're
going to store frequency counts instead of probabilities, because
probabilities are floating point values.  Also, following the example
from _Text Compression_, we're going to store things backward, so that
CumFreq[1] is the largest and CumFreq[max_symbol] is the smallest.
CumFreq[0] holds the total frequency count, so that

     CumProb[n] = CumFreq[n] / CumFreq[0]

This allows us to deal with probabilities as integers.  CumFreq[0]
must start out at 1.

Now, to figure out which symbol's probabilities bracked the value
we're working with, we use:

     range = high - low +1
     srch = ((value - low +1) * CumFreq[0] -1) / range;
     for (sym = 1 ; CumFreq[sym - 1] > srch; sym++)
          ;

This isn't all that clear.  Rearranged, the assignment for "srch"
looks like:

             (value - low +1) * CumFreq[0]  -1
     srch = -----------------------------------
              high - low +1

If you drop the +/- 1 stuff (which is there because of integer
truncation), you get:

             value - low
     srch = ------------- * CumFreq[0]
             high - low

This means that srch is being assigned a frequency count that is
proportional to the distance "value" is inside the interval.  If value
is equal to low, srch will be zero.  If it's (almost) equal to high,
then it will be (almost) equal to CumFreq[0], which translates to a
CumProb of (almost) 1.

-=*=-

It should be clear from the above that integer round-off errors could
be catastrophic if they ended up mapping two different symbols to the
same part of the interval.  Because of this, we need to guarantee that
the difference between low and high is big enough to prevent it.

This is probably the hardest part to figure out.  Nelson took one
approach, Bell/Witten/Cleary took another.  I'm going to sort of
weasel through both of them.

Suppose your range (represented as a FINITE-precision value) is
something like high = 50004, low = 49995.  They're converging, but
slowly.  It is possible for this to stretch out indefinitely, making
it (apparently) impossible to deal with incrementally, because we will
eventually run out of precision.  At that point, the difference
between low and high will be so slight that we will be stuck.

However, the middle digits are all 0s on the high one and all 9s in
the low one.  Something similar will also hold for low and high
represented in binary.  Basically, we can store a *count* of the
number of values in the middle which don't match, and then output them
when the hi bit finally becomes equal.  Until then, we just shift them
out without touching the hi bit; for the floating point numbers, you'd
have high = 54, low = 49, and underflow_count = 3.

The trick is, we won't know if these are supposed to be 0s or 1s until
the most significant bits become equal.  When they do, we want to
output the OPPOSITE value.  Why the opposite?  It's hard to explain
without a picture, but it should be clear if you notice that the
strings will look like "100001" and "011110".  It will converge to one
or the other, determined by the hi bit.

Note that even this is not a complete solution.  The underflow can
continue forever, so we will eventually overflow the underflow
counter.  We can make it highly improbable by using a large enough
integer to hold the count, but a large enough message with the correct
set of values will cause the compressor to break.

We implement underflow prevention by checking the next bit over from
the hi bit (0x4000).  If the hi bits are not equal, and the next-bits
are not equal to each other or the hi bits, then we need to shift them
out of the picture.  This is actually easier than it sounds: if the hi
bits are not equal, then we know that the high hi bit is 1, and the
low hi bit is 0.  Then danger can only occur if the high next-bit is 0
and the low next-bit is 1, which we can check for easily.  The cute
trick here is that we don't need to do any weird shifting; we can just
invert the next-bit and shift it left.  This has the same effect as
yanking the next-bit out and shifting left while preserving the hi
bit.

Bell/Witten/Cleary explain and implement this by dividing the interval
into quarters.  If high and low are in their respective halves but
both are in the middle quarters, then you are headed for trouble.  (I
actually found all the half/quarter stuff to be confusing... it makes
for nice illustrations, but I tend to think in terms of bit-level
operations.)

If it feels like we are outputting a lot of bits here, your instincts
are correct.  This problem mainly surfaces for highly improbable
events.

Incidentally, the decoder should be performing the exact same steps as
the encoder, so this discussion pertains to that as well.

-=*=-

We are also faced with an overflow problem because of the
multiplications we are doing, i.e. range * CumProb[].  If we want to
fit that into an integer, then we must have:

     f <= c - 2
     f + c <= p

where f is the number of bits for the frequency counts, c is the
number of bits for high and low, and p is the precision to which
arithmetic is performed (i.e. the size of your integers).

Restricting p to 16 doesn't do very well, because you end up with c=9
and f=7.  Since each symbol must have a frequency count of at least 1,
and CumFreq must be < 128, it isn't possible to represent a full set
of 256 symbols.  So, we'll have to bite the proverbial bullet and let
p be 32, but that allows us to choose c=16 and f=14 (which expedites
operations on 16-bit ints but still has a fairly large range of
frequencies).

-=*=-

We're going to continue the fine tradition of ending a file with a
distinguished stop symbol.  In this case, we need to output the bits
for the stop symbol, and then a few more to ensure that the encoded
string falls within the final range.  It turns out that it is
necessary to send two more bits (either 10 or 01) to ensure that it
ends up in the right quarter of the interval.  After those two, it
doesn't matter what we output.

(I would've thought that outputting a nice fat string of 0s would be
sufficient, but this is not the case.  The idea is to avoid underflow
problems.  Keep in mind that arithmetic coding doesn't just yank a few
bits off and find the appropriate symbol; it continuously updates low
and high, and uses them to find the interval in which the symbol lies.
If the interval is very small after the stop symbol (which is, by
definition, infrequent!), then you want to make sure the decoder can
unambiguously decode the symbol.)

-=*=-

So that's basically how it works.  By carefully selecting the
bit-widths of the variables to prevent overflow, and carefully
watching the high and low markers to prevent underflow, we can
incrementally process the whole floating point value.

It's not remarkably efficient because of the multiply and divide
instructions we have to do, but there are ways of avoiding these,
usually at the cost of poorer compression.

-=O=-     Models

In case I haven't made this clear enough already, an encoder is only
as good as the model driving it.  Arithmetic coding will encode
symbols according to their probabilities in a manner which gets very
close to the entropy if the input (the factors reducing effectiveness
are message termination overhead, use of finite-precision arithmetic,
and the scaling of frequencies so that they don't exceed the maximum).
However, recall that the entropy is defined in terms of a particular
model.

The model included with this week's program is the simple adaptive
model included in the CACM article and in _Text Compression_.  It
doesn't really do anything that the adaptive Huffman encoder didn't;
it just keeps a count of frequencies, and rescales when necessary.
The mechanics of it are explained in the next chapter.

The presence of a nearly-optimal, universal encoder allows us to use
models which would not have been feasible otherwise.  For example, go
back to lesson 3, where we discussed order-N models and "blending".
We can now use higher-order statistical models (like order-N context
models or finite-state models) to compute the probabilities.

Chapter 6 of _The Data Compression Book_ contains an excellent
discussion (with source code) of an efficient higher-order model.  If
you find this stuff intriguing, you might want to get a copy.
Incidentally, the figures in the book indicate that the speed of the
algorithm is less than one-tenth of LZW - and this is on a machine
with hardware multiply and divide instructions.

-=*=-

One interesting alternative is to use strictly bit-level
probabilities.  Since you have only two elements, you can do away with
the multiply and divide instructions.  A method for doing this is
presented in section 5.3.4 of _Text Compression_, but unfortunately
the idea (known as Q-coding) is patented.

-=O=-     PROGRAM

All of the sample code is in C.  The only one I have ported to
Orca/APW is in the directory "cacm", which holds the canonical
sources.

The files here are:

     arithmetic_coding.h --> arith.coding.h

          Header file with bit-width definitions.

     encode.c --> encode.ORIG
     decode.c --> decode.ORIG

          Front-end to the compressors; replaced with the
          usual class cmain.c and carith.c.

     arithmetic_encode.c --> arith.enc.c
     arithmetic_decode.c --> arith.dec.c

          The main body of the encoder.

     bit_input.c --> bit.input.c
     bit_output.c --> bit.output.c

          Bit-level I/O routines.

     model.h

          Interface to the model.

     adaptive_model.c --> adapt.mdl.c

          The adaptive model itself.

Note how the model is actually in a separate file, with a well-defined
interface ("start_model()" and "update_model()").  This clearly
distinguishes the model from the encoder.

-=*=-

There isn't a whole lot to say about the coder.  The main loop in both
the encoder and decoder works the way they have all along: get a
character, encode it, loop until EOF, output a stop symbol.  Opposite
stuff for the decoder.

The bit-level I/O stuff is pretty straightforward.  They do some
sanity checking in the input routines, but nothing fancy.

The encoder and decoder do exactly what I said they did.  The encoder
recomputes the range and then does some comparisons to see if it
should output bits or capture underflow bits.  The decoder uses the
formula I gave to find the symbol, and then does the same batch of
stuff to check for underflow bits or no longer needed bits.

-=*=-

The implementation of the model is interesting.  Recall that we need
to keep a list of the cumulative totals, preferrably in sorted order
to make searching easier.  So, we keep the list in frequency order,
with a parallel index indicating which symbol is associated with each
entry.  That way, when a symbol changes rank, we can just change the
index.

When a symbol is received, its frequency is incremented, and the
cumulative frequencies of it and all those above it are incremented
(so we need two different arrays here).  This makes lookups fast, and
updates are fast for small alphabets (e.g. mostly a-z with some
A-Z0-9, instead of the entire set of 256 characters).  The cumulative
frequencies must be recalculated after the list is scaled.

Note that, since the table is sorted in frequency order, the most
frequent items are at the head of the list, so a linear search will be
reasonably efficient for small alphabets.

-=*=-

That about does it.  PLEASE take a look at the code, even if you don't
know C.  It looks like it was translated from Pascal (they use the
Half/Quarter stuff instead of bit operations, use "foo += 1" instead
of "foo++", etc), so it shouldn't be too hard to understand.  The
"for" loop that does the table search looks like this:

     for (blah ; ack ; splat) ;

Note the semicolon on the end.  This means "do nothing."  The value
desired will be in the loop counter when the loop exits.

-=*=-

The contents of the "fast" directory are for a more optimized version
of the same thing.  The "low" directory contains a version which
avoids the divides at the expense of lower-precision arithmetic for
the probabilities.  I haven't examined either of these closely, and
haven't tried to port them to the //gs, but I thought they would be
interesting.

The "acms" directory contains an interesting variant from somebody's
thesis, called arithmetic coding with scalar quantization.  See the
README file for more info.

There are some sources for arithmetic coders with higher-order models,
but I wasn't able to ascertain what restrictions were placed on the
distribution.  They're really slow on 80x86s though, and the code is
not exactly intuitive, so I doubt they'd have much value here.

-=O=-     DEEP THOUGHTS

It's hard to get much deeper than this.

-=O=-     LEGAL STUFF

Here's the blurb from the top of the "acms" stuff.

Arithmetic coding routines by A.C. Popat, 1990.  For theory, see
    Popat, Ashok C., "Scalar Quantization with Arithmetic Coding,"
    SM thesis, MIT Dept. of Elec. Eng. and Comp. Sci., Cambridge,
    Massaachusetts, June 1990.


Copyright 1990 Massachusetts Institute of Technology
All rights reserved.
Permission to use, copy, or modify this software and its documentation
for educational and research purposes only and without fee is hereby
granted, provided that this copyright notice appear on all copies and
supporting documentation.  For any other uses of this software, in
original or modified form, including but not limited to distribution
in whole or in part, specific prior permission must be obtained from
M.I.T. and the author.  These programs shall not be used, rewritten,
or adapted as the basis of a commercial software or hardware product
without first obtaining appropriate licenses from M.I.T.  M.I.T. makes
no representations about the suitability of this software for any
purpose.  It is provided "as is" without express or implied warranty.

-=!=-