Hacking Data Compression
This is a twelve-part course on lossless data compression. It was originally presented in the "A2Pro" forum of the GEnie online service, starting back in October 1992 (almost exactly 10 years ago to the day that I type this).
Each part of the course consists of a lesson and some source code. The sources were presented in C and 65816 assembly language, the latter intended for use on an Apple IIgs computer. The C code was actually developed under SunOS, but was usable with either of the 16-bit compilers then available for the IIgs.
Some of the commentary is out of date. For example, this was written before gzip and the "deflate" algorithm existed, at a time when "freeze" was being explored as a replacement for the UNIX "compress" command. You can see the precursors to these in lesson 11. On the flip side, we're still using Huffman encoding (e.g. it's used in bzip2), and GIF LZW is still widely used.
I've made the lesson text available here. For the full original lesson files, complete with source code and other goodies, you can download the full archive in ShrinkIt format. You will need a utility like NuLib2 or CiderPress to unpack it.
- Lesson 1: Introduction
- Course overview and introduction to compression concepts.
- Lesson 2: Basics
- Discussion of RLE and differential coding.
- Lesson 3: Somewhat Clever Ad-Hoc Schemes
- Simple character encodings, MTF, digrams; intro to higher-order models.
- Lesson 4: Shannon, Fano, and Huffman
- Explanation of probability encoding.
- Lesson 5: Adaptive Encoding
- Adaptive vs. static methods, Splay trees, adaptive Huffman.
- Lesson 6: Arithmetic Coding
- How arithmetic encoding works; includes the CACM sources and variations.
- Lesson 7: Ziv and Lempel
- Discussion of LZ77 and LZ78.
- Lesson 8: LZW
- Detailed explanation of the LZW algorithm.
- Lesson 9: NuFX and ShrinkIt
- Discussion about the compression used in ShrinkIt.
- Lesson 10: LZSS
- Detailed explanation of the LZSS algorithm.
- Lesson 11: LZH, LZARI, and LZB
- How to improve LZSS with Huffman, arithmetic, and unary coding.
- Lesson 12: HardPressed(tm)
- How to write compression modules for HardPressed.
I also have the original syllabus, which more or less matches the above.
Historical note: this was written during the same period that HardPressed was under development, and while I was working full-time. This might help to explain why a 12-week course was presented over a period of nine months.
For anyone intrigued by the "stackable" compression modules described in lesson 12, I have made the complete source code to HardPressed available on this site.