Hacking Data Compression Lesson 6 Arithmetic Coding By Andy McFadden GEnie A2Pro Apple II University - Copyright (C) 1992 GEnie Okay, here's where things start to get weird. This week's subject is arithmetic encoding, a method of encoding symbols that can produce output which is extremely close in size to the entropy of the input. It is neither easy to understand nor easy to explain, but I'll give it my best shot. It's really too slow to be useful on an Apple II unless you play some games with it (which I will also endeavor to explain), so there's no assembly code. There's plenty of C code, however. The book I mention all too often (_Text Compression_, by Bell, Witten, and Cleary) is probably the definitive reference on arithmetic coding. My explanation is largely derived from chapter 5, and revolves around their implementation of it. The section in _The Data Compression Book_, by Mark Nelson, is also good. -=O=- It does WHAT? As I have alluded to in earlier lessons, arithmetic coding can represent several bytes with a single bit, and can handle probabilities more accurately than encoders like Huffman or Shannon-Fano (which always round probablities to powers of two). To accomplish this, it must do something a bit strange. Simply put, it encodes the entire input stream into a single floating point value between 0 and 1. By being fairly clever about it, we don't need infinite-precision math routines to handle it. When you realize that the binary representation of a floating point number looks like "0.011001011011", you can see that it's just a "0." followed by a stream of bits. By interpreting the value a few bits at a time instead of as a single value, we can do all of the processing with integer math. -=*=- I'm going to start off with an overview of how things work. For the moment, assume that we have infinitely precise floating point numbers at our disposal, so we can work in decimal. The output of the arithmetic coder is going to be a number X, where 0 <= X < 1 (the authors of _Text Compression_ use the mathematical notation "[0, 1)", which means the same thing). The model feeding us probabilities will have divided the symbols up so that the sum of their probabilities is equal to 1, which means that the numbers divide the region 0<=X<1 into neat little pieces. Suppose we have three symbols {A, B, C}, with probabilities: A 0.3 B 0.2 C 0.5 This means that they each get their own piece of territory: A 0 <= X < 0.3 B 0.3 <= X < 0.5 C 0.5 <= X < 1 Each one ends up just below the next value. So we start out with 0 <= X < 1. When encoding a message (say, "CBCCA"), each step causes the encoder to reduce that range to just that occupied by the symbol we read. So when we read the initial 'C', we reduce the range to 0.5 <= X < 1. The next time we read a symbol, we reduce the range we just computed. Since the next character is 'B', we want to reduce the region to the area ranging from 0.3 to 0.5 INSIDE 0.5 <= X < 1. Let me try to illustrate it, starting over from the beginning: For lower bound = 0, upper bound = 1, A 0 <= X < 0.3 --> 0 0.3 B 0.3 <= X < 0.5 --> 0.3 0.5 C 0.5 <= X < 1 --> 0.5 1 We found a 'C', so we use 0.5 and 1, leading to: For lower bound = 0.5, upper bound = 1, A 0 <= X < 0.3 --> 0.5 0.65 B 0.3 <= X < 0.5 --> 0.65 0.75 C 0.5 <= X < 1 --> 0.75 1 The second character is a 'B', so we use 0.65 and 0.75, giving us: For lower bound = 0.65, upper bound = 0.75, A 0 <= X < 0.3 --> 0.65 0.68 B 0.3 <= X < 0.5 --> 0.68 0.7 C 0.5 <= X < 1 --> 0.7 0.75 Next up is a 'C', so we use the interval 0.7 to 0.75, resulting in: For lower bound = 0.7, upper bound = 0.75, A 0 <= X < 0.3 --> 0.7 0.715 B 0.3 <= X < 0.5 --> 0.715 0.725 C 0.5 <= X < 1 --> 0.725 0.75 The fourth character is a 'C', so we would take 0.725 to 0.75. The fifth character is an 'A', so we would take 0.725 to 0.7325. If we had more data, we would just keep on going this way. At each step, the interval (or in fact any number between the high and low values) uniquely represents the message. Before we discuss decoding, notice how the "0.7" locked into place. It will ALWAYS be 0.7, and really has no effect on future calculations. This is important when dealing with the output incrementally, because it allows us to shift away the 0.7, leaving just the "active" parts of the number. BTW, the formula I'm using to compute the intervals at each step looks like this (still using floating point arithmetic): range = high - low high = low + range * CumProb[symbol] low = low + range * CumProb[symbol-1] CumProb holds the cumulative probabilities for each symbol. So CumProb[A] holds 0.3, CumProb[B] holds 0.5, and CumProb[C] holds 1.0. Note that we are skirting around the less-than part of the inequalities, because the code above computes EXACTLY 1.0 instead of just-slightly-less-than 1.0. It's a lot easier to implement this way, and we can handle it correctly with some fancy footwork. Notice that, because CumProb and range are always positive, "low" never decreases. Since CumProb is always less than or equal to one, and range gets smaller at each step, "high" never increases. They just get closer and closer to each other. (What happens if our floating point routines slip up and they MEET each other?) -=*=- The final result of our compression was the interval 0.725 <= X < 0.7325, which means that any number in that range is sufficient to uniquely identify the message. We can pick any number we want, but for simplicity let's just grab the "low" value (note that the "high" value would be invalid). So we've got this number and the same probability distribution as before. If we start from 0 <= X < 1 again, we start off with: For lower bound = 0, upper bound = 1, A 0 <= X < 0.3 --> 0 0.3 B 0.3 <= X < 0.5 --> 0.3 0.5 C 0.5 <= X < 1 --> 0.5 1 So, where does the number fit? Well, 0.725 is between 0.5 and 1, so the first character must be a 'C'. So, we now apply the same transformation that we did before when we were encoding, and reduce the range to 0.5 <= X < 1: For lower bound = 0.5, upper bound = 1, A 0 <= X < 0.3 --> 0.5 0.65 B 0.3 <= X < 0.5 --> 0.65 0.75 C 0.5 <= X < 1 --> 0.75 1 So where does 0.725 fit in? In the 'B' range, so the second character must've been a 'B'. We again reduce the range as before, and continue on in this manner until we've decoded the entire message. Two small issues. First, how do we know when to stop? For example, the output "0.0" could mean "A", "AA", "AAA", and so on. We can code an infinite number of As with one bit! We can tell the decoder to stop with one of the techniques used earlier (either record the length at the start of the file, or use a distinguished stop symbol). Second, how do we find out which range it belongs in? Well, the easiest approach is to do a linear search through the CumProb table to find a range which matches. This is the approach we will use. It's also possible to do a binary search on it, but if we're using an adaptive model it can be expensive to sort the table after each character. -=*=- Hopefully by now you can see how it works. We start with the interval 0 <= X < 1, and at each step we reduce it according to the probability of the symbol. Infrequent symbols will reduce the interval more sharply, requiring more bits than frequent symbols. After each reduction, we effectively expand it to full size, and then chop another piece out of it. Eventually, the pieces get pretty small. It's hard to illustrate it when you've only got 80 columns and standard text characters. -=O=- Implementation issues Now we have to throw away the infinite precision floating point number stuff, and deal with 1s and 0s. We saw that the leftmost digit will "freeze" and never change again; this happens exactly when the digit is the same in both the high and low boundaries (if you don't believe me, go back and look). Whenver the most significant bit of high and low is equal, we can shift it out and write it to the file, because it will never change and will not have any effect on future calculations. The effects of simply removing it and of shifting it out are the same. Whether we change 0.725 - 0.7325 to 0.025 - 0.0325 or 0.25 - 0.325 is immaterial; each step depends on the DIFFERENCE between them. This is what allows us to use less precise arithmetic. -=*=- We will represent the bounds as integers, so it is convenient to switch from [low, high) to [low, high]; that is, low <= X <= high. What we really want to do is represent it as [low, high+0.11111...), where 0.111111... is an infinitely repeating binary fraction. This suggests that, when we shift the most significant bit off, we should shift 0s into low and 1s into high, making it look like there's an infinite string of 1s trailing off the end of high. Here's how we stuff this concept in. When encoding, we want to output bits and shift whenever the most significant bits of high and low are the same. So, if we're using 16-bit integers to represent the values, we can just logically AND off the other bits with 0x8000 and compare them. If they match, we output the hi bit, shift both values to the left, and then add one to the high value to preserve the 0.1111... effect. (You can get the same effect in Pascal by comparing against the halfway point, which in this case is 0x8000. If both values are above or below, then the hi bit must match and you know what to output. You can get the bit shift effect by multiplying by 2, though you will have to multiply (low - HALF) and (high - HALF) when the MSB is 1 if you want to be portable to architectures with 32 bit integers. It's also interesting to draw it this way, because you want to do the shift whenever high and low are both on the same side of the midpoint of the interval.) Note that more than one bit may match at any step. Thus, we need to put the shift-out stuff in a while loop. When decoding, after we have figured out what the current symbol is, we can shift out any bits which match, and shift in some new bits from the input. It's very much like encoding. -=*=- How do we identify what the next character should be? Before we get into that, we need a slight change in the way we store things. We're going to store frequency counts instead of probabilities, because probabilities are floating point values. Also, following the example from _Text Compression_, we're going to store things backward, so that CumFreq[1] is the largest and CumFreq[max_symbol] is the smallest. CumFreq[0] holds the total frequency count, so that CumProb[n] = CumFreq[n] / CumFreq[0] This allows us to deal with probabilities as integers. CumFreq[0] must start out at 1. Now, to figure out which symbol's probabilities bracked the value we're working with, we use: range = high - low +1 srch = ((value - low +1) * CumFreq[0] -1) / range; for (sym = 1 ; CumFreq[sym - 1] > srch; sym++) ; This isn't all that clear. Rearranged, the assignment for "srch" looks like: (value - low +1) * CumFreq[0] -1 srch = ----------------------------------- high - low +1 If you drop the +/- 1 stuff (which is there because of integer truncation), you get: value - low srch = ------------- * CumFreq[0] high - low This means that srch is being assigned a frequency count that is proportional to the distance "value" is inside the interval. If value is equal to low, srch will be zero. If it's (almost) equal to high, then it will be (almost) equal to CumFreq[0], which translates to a CumProb of (almost) 1. -=*=- It should be clear from the above that integer round-off errors could be catastrophic if they ended up mapping two different symbols to the same part of the interval. Because of this, we need to guarantee that the difference between low and high is big enough to prevent it. This is probably the hardest part to figure out. Nelson took one approach, Bell/Witten/Cleary took another. I'm going to sort of weasel through both of them. Suppose your range (represented as a FINITE-precision value) is something like high = 50004, low = 49995. They're converging, but slowly. It is possible for this to stretch out indefinitely, making it (apparently) impossible to deal with incrementally, because we will eventually run out of precision. At that point, the difference between low and high will be so slight that we will be stuck. However, the middle digits are all 0s on the high one and all 9s in the low one. Something similar will also hold for low and high represented in binary. Basically, we can store a *count* of the number of values in the middle which don't match, and then output them when the hi bit finally becomes equal. Until then, we just shift them out without touching the hi bit; for the floating point numbers, you'd have high = 54, low = 49, and underflow_count = 3. The trick is, we won't know if these are supposed to be 0s or 1s until the most significant bits become equal. When they do, we want to output the OPPOSITE value. Why the opposite? It's hard to explain without a picture, but it should be clear if you notice that the strings will look like "100001" and "011110". It will converge to one or the other, determined by the hi bit. Note that even this is not a complete solution. The underflow can continue forever, so we will eventually overflow the underflow counter. We can make it highly improbable by using a large enough integer to hold the count, but a large enough message with the correct set of values will cause the compressor to break. We implement underflow prevention by checking the next bit over from the hi bit (0x4000). If the hi bits are not equal, and the next-bits are not equal to each other or the hi bits, then we need to shift them out of the picture. This is actually easier than it sounds: if the hi bits are not equal, then we know that the high hi bit is 1, and the low hi bit is 0. Then danger can only occur if the high next-bit is 0 and the low next-bit is 1, which we can check for easily. The cute trick here is that we don't need to do any weird shifting; we can just invert the next-bit and shift it left. This has the same effect as yanking the next-bit out and shifting left while preserving the hi bit. Bell/Witten/Cleary explain and implement this by dividing the interval into quarters. If high and low are in their respective halves but both are in the middle quarters, then you are headed for trouble. (I actually found all the half/quarter stuff to be confusing... it makes for nice illustrations, but I tend to think in terms of bit-level operations.) If it feels like we are outputting a lot of bits here, your instincts are correct. This problem mainly surfaces for highly improbable events. Incidentally, the decoder should be performing the exact same steps as the encoder, so this discussion pertains to that as well. -=*=- We are also faced with an overflow problem because of the multiplications we are doing, i.e. range * CumProb[]. If we want to fit that into an integer, then we must have: f <= c - 2 f + c <= p where f is the number of bits for the frequency counts, c is the number of bits for high and low, and p is the precision to which arithmetic is performed (i.e. the size of your integers). Restricting p to 16 doesn't do very well, because you end up with c=9 and f=7. Since each symbol must have a frequency count of at least 1, and CumFreq must be < 128, it isn't possible to represent a full set of 256 symbols. So, we'll have to bite the proverbial bullet and let p be 32, but that allows us to choose c=16 and f=14 (which expedites operations on 16-bit ints but still has a fairly large range of frequencies). -=*=- We're going to continue the fine tradition of ending a file with a distinguished stop symbol. In this case, we need to output the bits for the stop symbol, and then a few more to ensure that the encoded string falls within the final range. It turns out that it is necessary to send two more bits (either 10 or 01) to ensure that it ends up in the right quarter of the interval. After those two, it doesn't matter what we output. (I would've thought that outputting a nice fat string of 0s would be sufficient, but this is not the case. The idea is to avoid underflow problems. Keep in mind that arithmetic coding doesn't just yank a few bits off and find the appropriate symbol; it continuously updates low and high, and uses them to find the interval in which the symbol lies. If the interval is very small after the stop symbol (which is, by definition, infrequent!), then you want to make sure the decoder can unambiguously decode the symbol.) -=*=- So that's basically how it works. By carefully selecting the bit-widths of the variables to prevent overflow, and carefully watching the high and low markers to prevent underflow, we can incrementally process the whole floating point value. It's not remarkably efficient because of the multiply and divide instructions we have to do, but there are ways of avoiding these, usually at the cost of poorer compression. -=O=- Models In case I haven't made this clear enough already, an encoder is only as good as the model driving it. Arithmetic coding will encode symbols according to their probabilities in a manner which gets very close to the entropy if the input (the factors reducing effectiveness are message termination overhead, use of finite-precision arithmetic, and the scaling of frequencies so that they don't exceed the maximum). However, recall that the entropy is defined in terms of a particular model. The model included with this week's program is the simple adaptive model included in the CACM article and in _Text Compression_. It doesn't really do anything that the adaptive Huffman encoder didn't; it just keeps a count of frequencies, and rescales when necessary. The mechanics of it are explained in the next chapter. The presence of a nearly-optimal, universal encoder allows us to use models which would not have been feasible otherwise. For example, go back to lesson 3, where we discussed order-N models and "blending". We can now use higher-order statistical models (like order-N context models or finite-state models) to compute the probabilities. Chapter 6 of _The Data Compression Book_ contains an excellent discussion (with source code) of an efficient higher-order model. If you find this stuff intriguing, you might want to get a copy. Incidentally, the figures in the book indicate that the speed of the algorithm is less than one-tenth of LZW - and this is on a machine with hardware multiply and divide instructions. -=*=- One interesting alternative is to use strictly bit-level probabilities. Since you have only two elements, you can do away with the multiply and divide instructions. A method for doing this is presented in section 5.3.4 of _Text Compression_, but unfortunately the idea (known as Q-coding) is patented. -=O=- PROGRAM All of the sample code is in C. The only one I have ported to Orca/APW is in the directory "cacm", which holds the canonical sources. The files here are: arithmetic_coding.h --> arith.coding.h Header file with bit-width definitions. encode.c --> encode.ORIG decode.c --> decode.ORIG Front-end to the compressors; replaced with the usual class cmain.c and carith.c. arithmetic_encode.c --> arith.enc.c arithmetic_decode.c --> arith.dec.c The main body of the encoder. bit_input.c --> bit.input.c bit_output.c --> bit.output.c Bit-level I/O routines. model.h Interface to the model. adaptive_model.c --> adapt.mdl.c The adaptive model itself. Note how the model is actually in a separate file, with a well-defined interface ("start_model()" and "update_model()"). This clearly distinguishes the model from the encoder. -=*=- There isn't a whole lot to say about the coder. The main loop in both the encoder and decoder works the way they have all along: get a character, encode it, loop until EOF, output a stop symbol. Opposite stuff for the decoder. The bit-level I/O stuff is pretty straightforward. They do some sanity checking in the input routines, but nothing fancy. The encoder and decoder do exactly what I said they did. The encoder recomputes the range and then does some comparisons to see if it should output bits or capture underflow bits. The decoder uses the formula I gave to find the symbol, and then does the same batch of stuff to check for underflow bits or no longer needed bits. -=*=- The implementation of the model is interesting. Recall that we need to keep a list of the cumulative totals, preferrably in sorted order to make searching easier. So, we keep the list in frequency order, with a parallel index indicating which symbol is associated with each entry. That way, when a symbol changes rank, we can just change the index. When a symbol is received, its frequency is incremented, and the cumulative frequencies of it and all those above it are incremented (so we need two different arrays here). This makes lookups fast, and updates are fast for small alphabets (e.g. mostly a-z with some A-Z0-9, instead of the entire set of 256 characters). The cumulative frequencies must be recalculated after the list is scaled. Note that, since the table is sorted in frequency order, the most frequent items are at the head of the list, so a linear search will be reasonably efficient for small alphabets. -=*=- That about does it. PLEASE take a look at the code, even if you don't know C. It looks like it was translated from Pascal (they use the Half/Quarter stuff instead of bit operations, use "foo += 1" instead of "foo++", etc), so it shouldn't be too hard to understand. The "for" loop that does the table search looks like this: for (blah ; ack ; splat) ; Note the semicolon on the end. This means "do nothing." The value desired will be in the loop counter when the loop exits. -=*=- The contents of the "fast" directory are for a more optimized version of the same thing. The "low" directory contains a version which avoids the divides at the expense of lower-precision arithmetic for the probabilities. I haven't examined either of these closely, and haven't tried to port them to the //gs, but I thought they would be interesting. The "acms" directory contains an interesting variant from somebody's thesis, called arithmetic coding with scalar quantization. See the README file for more info. There are some sources for arithmetic coders with higher-order models, but I wasn't able to ascertain what restrictions were placed on the distribution. They're really slow on 80x86s though, and the code is not exactly intuitive, so I doubt they'd have much value here. -=O=- DEEP THOUGHTS It's hard to get much deeper than this. -=O=- LEGAL STUFF Here's the blurb from the top of the "acms" stuff. Arithmetic coding routines by A.C. Popat, 1990. For theory, see Popat, Ashok C., "Scalar Quantization with Arithmetic Coding," SM thesis, MIT Dept. of Elec. Eng. and Comp. Sci., Cambridge, Massaachusetts, June 1990. Copyright 1990 Massachusetts Institute of Technology All rights reserved. Permission to use, copy, or modify this software and its documentation for educational and research purposes only and without fee is hereby granted, provided that this copyright notice appear on all copies and supporting documentation. For any other uses of this software, in original or modified form, including but not limited to distribution in whole or in part, specific prior permission must be obtained from M.I.T. and the author. These programs shall not be used, rewritten, or adapted as the basis of a commercial software or hardware product without first obtaining appropriate licenses from M.I.T. M.I.T. makes no representations about the suitability of this software for any purpose. It is provided "as is" without express or implied warranty. -=!=-
This document is Copyright by Genie and Andy McFadden