This is the first part of a three-part article about data compression, with particular focus on media (images, audio and video). It does not require any knowledge of information theory, mathematics or programming. I have tried to use language accessible to anyone without sacrificing the correctness and thoroughness of the information provided.
Use the following links to display each part of the article:
- Part 1 - Lossless data compression
- Part 2 - Lossy compression of stills and audio
- Part 3 - Video compression
When some issues are considered more complex (or less relevant), they will be placed inside a note box. You don't need to read the notes to understand the fundamental principles being described, but you might want to check them out anyway, especially if you noticed a potential problem with some special case of the processes described in the previous paragraph(s).
Binary digit (bit) will sometimes be abbreviated as "b", and bit octet (byte) will sometimes be abbreviated as "B". Those abbreviations might be prefixed by standard SI magnitude symbols like "k" (kilo), "M" (mega) or "G" (giga). So "kb" stands for "kilobit", "kB" stands for "kilobyte", "Mb" stands for "megabit", and so on. When the number is represented using a "binary magnitude" (i.e., using a factor of 1024, instead of 1000), it will use the IEC binary prefixes instead (kiB, MiB, GiB). See the FAQ for more details about this distinction.
The advantages of using data compression are pretty straightforward. It lets you store more stuff in the same space, and it lets you transfer that stuff in less time, or with less bandwidth.
I used the word "stuff" and not "information" because, for pure data compression algorithms, the compressed file contains exactly the same amount of information as the original one. Of course, this is using the term "information" in a mathematical sense (normal people wouldn't consider a ZIP-compressed text file particularly "informative"). As mentioned above, this article isn't going to be too technical and you don't need any specific knowledge of the underlying mathematics or terminology to understand it, but still, it's important to point out that there is a distinction between information and data (okay, "stuff" isn't the actual technical term for it). For example, if someone sends you the same e-mail twice, you have two e-mails' worth of data, but just one e-mail's worth of information.
Note...
Actually, you might consider that, by receiving the same e-mail twice, you have learned something new (ex., that your friend could be having short-term memory problems), so the repetition itself might be considered to add a little bit of information.
The fundamental idea behind digital data compression is to take a given representation of information (a chunk of binary data) and replace it with a different representation (another chunk of binary data) that takes up less space (space here being measured in binary digits, better known as bits), and from which the original information can later be recovered. If the recovered information is guaranteed to be exactly identical to the original, the compression method is described as "lossless". If the recovered information is not guaranteed to be exactly identical, the compression method is described as "lossy".
Let's start by looking at a simple lossless compression algorithm, called "run-length encoding", or RLE.
Let's say we have a text file consisting of the following characters:
AAAAAAAAAABBCDEEFFFFFFFFFFFFFGGGGGGGGGGGGGGGGG
Run-length encoding replaces "runs" (that is, sequences of identical characters) with a single character, followed by the "length of the run" (the number of characters in that sequence), or vice-versa (first the length and then the character, the order isn't important as long as it's always the same). If we apply this algorithm to the text file above, we'll get something like this:
10A 2B 1C 1D 2E 13F 17G
We have just gone from 46 characters to 17, and that's using a plain text representation for the lengths. I used spaces between the pairs, and different colours for the lengths and characters, to make it easier to read, but naturally these wouldn't be necessary for a computer. In computer-generated files, each character typically requires one byte to be stored (in fact, "character" is commonly used to describe any value that's exactly one byte long, even if it's not in a text file), and a single byte can also be used to represent run lengths up to 255 characters long (256, if we also use the value "zero"), so the above could be encoded in just 14 bytes (7 lengths and 7 characters, each pair requiring 2 bytes).
But you may have noticed that not all sequences of characters became shorter using this new representation. For example, "BB" became "2B", which takes up the same two bytes. And "C" actually doubled in size (from one to two bytes), becoming "1C".
It probably also occurred to you that runs of more than two identical characters are rare in human languages. So if you take a real sentence, such as:
Mary had a little lamb.
And run it through our RLE "compressor", you get something like this:
1M 1a 1r 1y 1 1h 1a 1d 1 1a 1 1L 1i 2t 1l 1e 1 1L 1a 1m 1b 1.
In other words, the only part that doesn't get bigger is the "tt" in "little", and even that just manages to stay the same size (2 bytes). So we've just gone from 23 bytes (23 characters) to 44 (remember, spaces in the original sentence must be encoded, too, otherwise they won't be there when we decompress it). Not exactly brilliant, as far as "compression" goes.
However, if we have, say, an invoice, where a lot of lines are similar to this:
Item X............................................................$100.00
Then maybe RLE can do something for us. After all, there are 60 dots between the item name and its price. Although "Item X" would double in size (from 6 to 12 bytes), and "$100.00" would also grow (though just from 7 to 10 bytes, since the "00" sequences wouldn't grow), the sequence of dots would go from 60 to 2 bytes. In other words, the complete line would go from 73 bytes to 24. That's a reduction to about 1/3rd of the original size. Not too shabby.
But if the same file contained a couple of paragraphs of "normal" text, with few or no repetitions, all our gains would be lost. So ideally, we'd like to have a way to compress only some parts of the file, while leaving others uncompressed.
We can do this quite simply by defining a special "marker" that indicates the beginning or the end of each block of compressed text. For example, let's say we use an asterisk as our marker, and determine that compression is off by default and will only be turned on when there are three or more identical characters in a row. We now get (after compression):
Item X*[60].*$100.00
That's just 17 bytes (the square brackets indicate that the number between them is stored as a value, and not as a sequence of decimal digits; the value "60" can be represented by a single byte, in a computer file). And if we run the previous sentence through our "enhanced RLE" compressor, we now get:
Mary had a little lamb.
There's no sequence that would benefit from RLE compression, so our compressor leaves it alone. No gain, but no loss, either. It's starting to look as if RLE can, at the very least, "pay for itself" (i.e., not make files bigger). But what if the original file contained an asterisk? If we just pass that asterisk directly into our compressed file, like any other character, that's going to confuse our decompressor (it's going to assume that the compression has been toggled, and misinterpret the rest of the file). We can deal with this in several ways, but the most generic one is to define a new rule: if our marker appears in the source data, that part of the data is always encoded (i.e., it's treated as compressible data). So, if we have the following source data:
Figaro was the city's factotum*.
And run it through our compressor, we get:
Figaro was the city's factotum*[1]**.
This increased the size from 32 bytes to 35, so, ideally, we would instead pick a different character to use as our "marker". As long as the encoder and decoder both use the same character and know the same rules, it will work. We could find out what is the rarest character in written English, and always use that, making the situation above extremely rare. But we might want to use our compressor on other data types (not just English text), so ideally it would scan the source file, find out the rarest character in it, and use that as the special marker (and list it in the file header, so the decoder would know which character to look for).
Note that the method described above works for any number for markers in a row. For example, if you have a sequence of three markers (here represented by asterisks) in the source file:
***
And the compression toggle was already "on", those would be encoded as:
[3]*
If the compression toggle was "off", then it would be turned on by one asterisk, and the sequence would be encoded as:
*[3]*
...optionally adding another marker at the end to turn compression back off (if the following data wasn't compressible). The following examples should clarify how the data would be encoded if the marker sequence appeared in the middle of compressible or incompressible data:
The sequence AAA**DDDD would become *[3]A[2]*[4]D .
The sequence ABC**DEFG would become ABC*[2]**DEFG .
Note...
How does the decoder know that the asterisk inside the compressed block is a character from the original text, and not another compression toggle marker? Because the previous byte was a run length. Counting from a marker, odd bytes are run lengths and even bytes are characters from the original data. The length and character always appear as pairs, so an asterisk can only be considered as a compression toggle marker if there is an even number of bytes between it and the previous marker. This means the decompressor will never confuse a "character" with a marker, even if they have the same byte value. It could, however, confuse a run length with a marker, as described below.
We still have to deal with the case of a run length being identical to the value of our "special marker" byte. For example, if our marker was an asterisk, that has the ASCII byte value 42, then any sequence of 42 characters would cause an asterisk (i.e., a byte with the value 42) to be inserted in the compressed file. Later, the decoder would have no way to distinguish this asterisk from a "real" marker, since both would be bytes with the value 42, and both could appear in the same position relative to the previous marker.
For example, if there was a sequence of 42 identical characters in our source file:
AAAHH!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!???
This would be encoded as:
*[3]A[2]H[42]![3]?
However, the byte [42] happens to be the ASCII code for the asterisk, so the sequence above would be seen by the decoder as:
*[3]A[2]H*![3]?
In other words, the decoder would interpret the counter [42] as a compression toggle, and treat anything after it as uncompressed data, failing to decompress the exclamation and question marks, and adding a byte with the value 3 between them (a byte with the value 3 is actually interpreted as "end of text" by some text editors, which would make things even more confusing).
To deal with this, the encoder needs to translate all run-length values (ex., using an indexed table, or some rule that the decoder recognises), so that the run length of 42-character sequences is represented by another byte value (ex., zero). This means that although a byte can have 256 different values, our run length bytes can only use 255 different values, because one is reserved as our "special marker". This might seem like a small detail, but, if we ignored it, some files created by our compressor would later be impossible to decompress correctly.
Note...
In fact, a byte with the value zero (often called a NULL) is usually an excellent choice as a default marker, since we would probably never want to use zero as a run length anyway. However, if the source file we are trying to process contains a lot of null bytes, we might end up wasting a significant amount of space encoding those, so scanning the source for the rarest character is still the preferred solution.
There is one last optimisation worth mentioning, which is related to sequences of exactly two identical characters. Since those sequences always take up two bytes (ex., AA or [2]A), the decision to compress them or not should (ideally) be based on whether they appear immediately after compressible data or not. For example:
ABCDZZ would be encoded as ABCDZZ (i.e., the "ZZ" sequence would be treated as uncompressible).
AAAAZZ be encoded as *[4]A[2]Z (i.e., the "ZZ" sequence would be treated as compressible).
Although this makes no difference to the size of that sequence itself, it avoids adding unnecessary markers to define the start or end of compressed blocks.
Let's look at what we've achieved so far: we started with an algorithm that looked promising, realised it was no good for text (most text files would double in size), isolated the problem (normal text rarely has more than two identical letters in a row), came up with a solution (the marker to divide the data in compressed and non-compressed blocks), added a way to encode the marker itself when it appears in the source data, and added two extra optimisations (use the rarest character as the marker, and avoid needless compression toggles in the case of 2-character sequences).
Note...
At this point I could include the source code of a simple generic RLE compressor and decompressor, but that's not really the point of this article. If you are interested in programming, the text above has enough information to get you started, and you can also find several open-source RLE compressors on the internet.
Still, all the optimisations and special cases we added above were merely a way to avoid making the "compressor" increase the size of the files. The actual compression (i.e., reduction in size) will only work when there are sequences of more than three identical characters, and in normal text that's relatively rare. Most large text files do have enough such sequences (ex., of spaces, or linefeed characters) to let our "enhanced RLE" compressor save a few bytes, but a 2% or 3% compression factor probably isn't worth the hassle of compressing and decompressing every single file whenever we want to use it.
So, is RLE useless, except as a case study? No.
There is one type of data where very long sequences of identical values are common. As you may have guessed, I'm talking about images, specifically about high-contrast images with few colours, such as schematic drawings, logos, text scans (including most faxes) and cartoons. It's also common, in some CG animations, to use a plain background (to be replaced later), meaning that hundreds or even thousands of sequential pixels have exactly the same colour.
Note...
For Terry Pratchett fans, RLE is the "squeezing algorithm" described by Lieutenant Blouse in the book "Monstrous Regiment" when talking about how to speed up the transmission of maps through the clacks system.
Although another "family" of algorithms (which will be mentioned later) can provide compression at least as good as RLE even in those situations, RLE has the advantage of being very simple, and therefore very fast to execute. The TARGA (.TGA extension) image file format can use RLE compression, and is very popular with animators, since RLE-compressed TGA sequences can often be read from disk, decompressed, and shown on screen faster than uncompressed images. This is because CPUs and RAM are so much faster than hard drives that it's quicker to load, say, 500 kB of compressed data and decode it than it is to load 1000 kB of uncompressed data. Although a more advanced compression algorithm might get the size of that image down even further (ex., to 300 kB), the extra complexity would require more CPU cycles to decompress, resulting in overall lower performance. Also, virtually all image editing applications can read and write TGA files, which makes them a good choice for transferring frames from one program to another. Unfortunately some programs only support uncompressed TGA files.
Other image file formats that can use RLE compression include BMP, PCX, PackBits (a TIFF sub-format) and ILBM (an old format, used mainly in Amiga computers). Fax machines also use RLE (combined with another technique called entropy coding) to compress data before transmitting it.
Pixel colour information is generally stored as a sequence of three values, representing the amounts of red, green and blue (RGB) that define it. Black is usually represented as [0] [0] [0], while white is represented as [255] [255] [255] (this is because 255 is the highest number that can be stored in a single byte; you can think of it as "100%"). Red is represented as [255] [0] [0], green is represented as [0] [255] [0], orange is represented as [255] [128] [0], and so on. If you're familiar with digital colour representation this should be trivial, and if you're not, you don't really need to worry about it now. The only thing to remember is that (in most digital image formats, including TGA) each pixel's colour is represented not by one, but by three numbers.
Note...
In the case of images with built-in transparency data, each pixel is represented by four values, where the fourth (called alpha) defines the opacity of that pixel. Here we will assume the image has no opacity channel, but the same principles would apply if it did (with 4 channels instead of 3). The alpha channel is usually the easiest one to compress, since most pixels tend to be 100% opaque or 100% transparent.
Remember how RLE works? It "compresses" sequences of identical values. So what happens if we have, say, a completely orange image? Each pixel will be represented by a sequence of 3 bytes with the values 255, 128 and 0 (or similar ones, depending on the exact tone of orange).
[255][128][0][255][128][0] ...etc.
This sequence will repeat itself thousands of times (307200 times, for a 640x480 image), but there will never actually be a sequence of three or more identical values in our data. In other words, our RLE algorithm won't do anything to it.
Fig. 1 - A sequence of orange pixels
So, what can we do to make it work? We could rewrite the algorithm to use sequences of 24 bits (3 bytes) instead of 8 bits (1 byte), but there is a simpler (and generally more efficient) solution.
Instead of sorting the pixel colour data by pixel, we sort it by channel (a channel is each of the colour components; in this case red, green and blue). So, instead of having a file with 307200 repetitions of [255] [128] [000], we have a file with 307200 repetitions of [255], followed by 307200 repetitions of [128], followed by 307200 repetitions of [0]. And this kind of file is what RLE eats for breakfast.
It can't encode runs of more than 255 identical values, but each time one of those runs is encoded, it shrinks from 255 bytes to 2. Each of the three sequences of 307200 identical values can be encoded in just 2410 bytes. The full pixel data goes from 900kB to just over 7kB. That's a reduction of more than 99%!
This is a very special case (an image with a single colour, essentially just a "background"), but sorting the data by channel will improve the RLE compressibility of nearly all images. Naturally, we must also update the decompressor so that it restores the original order (by interleaving the channels).
This example shows how some of the steps involved in data compression don't reduce the amount of data per se, but simply reorganise it (based on some knowledge or "hunch" about patterns in that data) in such a way that the compression steps are more efficient. This same reorganisation wouldn't do anything for text or audio data, because there's no 3-byte pattern in the distribution of those types of data. It would just take longer and make the algorithm more complex, with no benefits. But when we apply a compression algorithm to a specific type of data, adapting it to that data's "typical" patterns can definitely pay off.
Fig. 2 - Sorting RGB channel data sequentially to improve RLE compression
Naturally, RLE-compressed TGA files use the trick described above, as do most other image compression algorithms, even when they're not based on RLE (or when the data isn't in RGB, for that matter).
As you have probably guessed, though, RLE by itself will never achieve high levels of compression on photographic images, or any other kind of image where there are no adjacent pixels with the same colour. A typical RLE image compressor will reduce the size of an uncompressed 24-bit RGB photograph by less than 5% (ex., from 900 kiB to 860 kiB). Reducing the number of colours in the image improves the compression (the less colours there are, the more likely it will be that two adjacent pixels have the same colour), but decreases the image quality.
There are, however, some compression techniques that combine RLE with other algorithms, as we'll see below.
This concludes our analysis of the RLE algorithm (for now). I will not go into as much detail about the next algorithms, since they are significantly more complex, and specific optimisations are hard to describe without filling the pages with mathematical formulas or program code.
In our example for the RLE image compressor we had a file where a sequence of three bytes - [255] [128] [0] - was repeated several thousand times. Since RLE wasn't able to compress that kind of pattern directly, we added another step to our algorithm, to sort the pixel colour data in a different way. We were able to include that transformation in our algorithm because we already knew that RGB image data tends to follow a certain pattern (that repeats itself after 3 bytes). But what if we were compressing a file where the typical patterns have an unknown length, or where there are multiple patterns, with different lengths?
Imagine we have an 8-bit RGB image where the first pixel is blue, the second one is orange, the third one is blue again, the fourth is orange, and so on. In other words, let's say the (uncompressed) data for our image looks like this:
[0][0][255] [255][128][0] [0][0][255] [255][128][0] ...etc.
Fig. 3 - Blue and orange pixel pattern
This sequence might seem slightly compressible, since there is a sequence of three repeated values followed by a sequence of two repeated values that appears again and again in our file. However, having to constantly switch between compressed and uncompressed mode would actually make the file larger:
[0] [0][0][255] [255] (5 bytes) would be converted into *[3][0][2][255]* (6 bytes, including the markers).
Applying our "RGB optimisation" (sorting the data by channel) wouldn't help, either. We'd get this:
R: [0][255][0][255][0][255] ...etc.
G: [0][128][0][128][0][128] ...etc.
B: [255][0][255][0][255][0] ...etc.
Still no sequences of identical values, so RLE wouldn't be able to compress it. We could come up with a stranger way to reorganise the data, but what if we then got a file with sequences of three different pixels instead of two? We'd have to keep changing our compressor (and decompressor) to adapt it to the contents of each image. Not very practical.
Instead, wouldn't it be better if the compressor could look for data patterns, regardless of their length, and replace the repetitions of those patterns with some sort of "counter", like RLE does for individual values?
Before going into specific details, let me call your attention to some examples of "compression" in our everyday language.
Instead of writing "Red, Green and Blue", we often write just "RGB". The same goes for things like HTML ("Hyper-Text Markup Language"), PC ("Personal Computer"), RAID ("Redundant Array of Inexpensive Disks"), and so on. Acronyms and initialisms are a form of data compression. However, they can't be "uncompressed" by themselves; the reader needs to know (or find out) their meaning by matching them to the "uncompressed" version.
In books, it's common to use a footnote to explain unusual terms or acronyms. Another option is to present the acronym only after it has been described. For example, the manual for a compositing program might include:
All these effects are available when working with high dynamic range (HDR) images.
And, from then on, always abbreviate the expression "high dynamic range" (because the authors assume people read the previous sentence, and still remember the meaning of that abbreviation):
To avoid clipping the highlights, make sure HDR mode is enabled.
When someone says "I saw a gemsbok", you might need to reach for your dictionary or encyclopedia to learn that a gemsbok is a South-African antelope. So, in a way, the word "gemsbok" was also a "compressed" version of "species of antelope native to South Africa". As long as you remember that, you can use the "compressed" term (in other words, you only need to reach for the dictionary the first time you come across the term - unless you happen to forget it).
Abraham Lempel is a computer scientist and Jacob Ziv is an electrical engineer. Together, in 1977 they published an algorithm, called LZ77, based on some of the principles described above. A slightly different version was published in 1978, under the name LZ78. And perhaps its most famous variant was published in 1984 by Terry Welch, under the name LZW.
All of these algorithms (and several others, based on them) belong to a family known as "dictionary-based compressors". They work by replacing redundant (i.e., repeated) source data with references to its previous appearance (LZ77) or by explicit references to a "dictionary" compiled from all the data in the source file (LZ78).
For example, if we have this sentence:
The Flying Spaghetti Monster is real. I believe in the Flying Spaghetti Monster.
We can replace the second instance of the words "he Flying Spaghetti Monster" (we can't include the "T" from the first word because one is uppercase and the other is lowercase) with a "length-distance pair" - a pair of numbers representing the distance to the previous occurrence of those words and the number of characters that are identical. In this example, that pair would be length = 27 & distance = 54. Again, we can use a special marker (represented here by an asterisk) to indicate that what follows is compressed data:
The Flying Spaghetti Monster is real. I believe in t*[27][54]*.
When the decoder encounters that code, it will step back 54 characters and copy a 27-character sequence into the new position. So we've just compressed the words "he Flying Spaghetti Monster" (27 bytes) into two values (plus two markers), which take up only 4 bytes.
This technique (of looking back and copying sequences using a length-distance pair) is called a "sliding window" algorithm. It's a special type of dictionary algorithm, where the dictionary is built along the way, and older sequences are forgotten (as they fall out of the "window"). If we chose to use a single byte to store the distance (of the length-distance pair), then we'd be limited to going back 256 characters, and there wouldn't be any point in keeping older sequences. If we chose to use a 2-byte value (known as a "short integer", or "int16") instead, then we could go back up to 65536 characters, and so on. The bigger the size of the window, the more efficient the compression will be, but it will require more RAM (system memory) to run.
Note...
Why 256 characters when 255 is the maximum value a byte can have? And why 65536 when 65535 is the maximum value a short integer can have? Simple: we can use the value 0 (zero) to represent 256 and 65536, respectively, since it would never be used otherwise. We don't necessary want to do that, though, we might prefer to save the value 0 to indicate that some error has occurred.
In a "pure" dictionary based encoder, the sequences are stored at the beginning of the compressed file, and remembered for all the duration of the compression and decompression. That allows the compressor to achieve better results with the same amount of RAM, but requires extra processing (to create the dictionary, the compressor needs to analyse all the source data before it starts to actually compress it).
So, let's see what our new algorithms can do with the image examples mentioned above. First with the pure orange background (which RLE was able to handle quite well, bringing the size down to about 7kB). The source data (RGB byte values) looks like this:
[255] [128] [0] [255] [128] [0] ...etc. (for a total of 307200 pixels, or 921600bytes)
Fig. 1b - A sequence of orange pixels
Using our LZ77-style sliding window technique, we can replace those 921600 bytes with this:
[255][128][0] * [921597][3] *
Note...
It might sound a bit confusing to say that the decoder must go back 3 bytes and then copy 921597 bytes. But if you think of this operation as taking place one byte at a time, you'll realise that it's perfectly possible, since new bytes are being inserted after our original three, and those are the source for the next copy operations.
We can't store the number 921597 as a single byte (the maximum would be 255 - assuming we don't reassign zero), and we can't even store it as an short integer (the maximum would be 65535), so let's assume we use a 4-byte value (known as an "integer", or "int32"), that can store values from 0 to 4294967295.
This means we have three source data bytes, followed by a 1-byte marker, followed by a 1-byte length, followed by a 4-byte distance, followed by another 1-byte marker (which isn't really necessary, since there's no data after it, but let's include it anyway. This adds up to 10 bytes.
Note...
Although this might seem a lot better than the result we got from our RLE compressor in the previous chapter, the comparison is unfair, because we limited the RLE compressor to using a single byte to store the length for each sequence, and we are allowing the LZ compressor to use 4-byte integers. If we had allowed our RLE encoder to use 4-byte values, it would have been able to compress the raw image data into 15 bytes; not as good as the LZ algorithm, but pretty close.
If we use an LZ78-style compressor instead, our compressed data would look something like this:
Dictionary: [X][255][128][0]
Reference list: [921597][X]
Where X is a code generated by the encoder. It's harder to measure the exact size here, since the file must use different types of markers to separate codes from symbols, lengths from references, and the dictionary itself from the reference list, but overall the results for such a simple pattern would be very similar to the sliding window version (in fact, both types of algorithm are mathematically equivalent).
Now let's look at the second case, where the colour of the pixels alternates between blue and orange (which we couldn't compress at all, using RLE). The (uncompressed) source data looks like this:
[0][0][255] [255][128][0] [0][0][255] [255][128][0] ...etc. (for a total of 921600 bytes)
Fig. 3b - Blue and orange pixel pattern
If we apply our LZ77-style compressor to it, we get this:
[0][0][255][255][128][0]*[921594][6]*
Which can be stored in 13 bytes.
The length-distance pairs can be concatenated, of course. For example, to store an image made from a pattern of sequences of 16 white pixels followed 32 black pixels, we could do this:
[255]*[47][1]*[0]*[95][1][921456][144]*
That's a total of 20 bytes. When that data is processed by the decompressor, the first length-distance pair adds 47 bytes with the value 255, which (added to our first byte) describe 16 white pixels (3 bytes per pixel, RGB). The second length-distance pair adds 95 zeros to our existing zero, thus describing 32 black pixels.
Fig 4. - Pattern with 16 white and 32 black pixels
The third length-distance pair takes all the bytes created by the first two (a total of 144 bytes) and repeats them for a further 921456, to make a grand total of 921600 bytes (the total number of bytes in an uncompressed 8-bits-per-channel RGB 640x480 image).
Given these results, it's easy to see why the LZ algorithms are the basis of most modern lossless data compressors. Naturally, the more complex the pattern(s), the bigger the file will get. The patterns described above were all relatively short and simple (hence the spectacular compression ratios), but even on a plain text file (which RLE would hardly be able to compress at all), LZ algorithms can deliver a compression ratio of about 3:1 (in other words, a text file with 1MB can usually be compressed to less than 350kB).
LZ algorithms compress data by locating and eliminating redundancy (repeated patterns). This makes them very good at compressing some types of images (ex., images with regular-pattern halftoning), but their performance is still highly influenced by the number of colours used. 24-bit photographs can have so many different colours that repeated patterns are relatively rare, and even LZ algorithms specifically optimised for image compression (that treat the RGB channels separately, etc.) will rarely achieve more than 30% compression (i.e., a 900 kiB uncompressed photo will typically be compressed to something between 600 and 700 kiB).
As with RLE, reducing the number of colours will improve compression, but at the cost of image quality. The chart below (linear scale) shows the influence of the number of colours on the compression achieved by an LZ algorithm. The uncompressed image size was 900 kiB.
Fig 5. - Influence of palette size on LZ compression
It might seem strange that the 8-bit (and therefore 256-tone) grayscale image is less compressible than the 256-colour image, but this is perfectly normal and happens with most images. Limiting a colour image's palette to 256 entries forces some similar colours to be merged (i.e., replaced by a single colour), which increases the chances of repeated sequences in the file (and that, in turn, makes it more compressible). Converting to grayscale is less likely to make adjacent pixels identical, since the resulting image preserves most of its original luminance information, and luminance tends to have more variation than hue, in "natural" images (we'll see how this can be used to improve compression in the second article in this series).
Given the complexity of LZ compressors, this article will not include any specific implementation or optimisation details. If you are interested in more information, you should be able to find it easily on the internet.
LZ algorithms are used by several lossless image file formats, such as GIF and TIFF. They are also used by generic data compression formats such as ZIP, RAR, 7-Zip, etc..
Entropy coding (or encoding) is a technique that assigns codes to blocks of data (sometimes called "symbols") in such a way that the length of the code is inversely proportional to the statistical probability of that symbol (typically on a logarithmic scale). In other words, it's a way to assign shorter codes to common data blocks, while assigning longer codes to rarer data blocks.
Naturally, it's only useful when the codes use a variable length, and when the number of codes is such that not all can use the shortest size.
Note...
This might sound similar to dictionary compression, but it operates at a different level. A dictionary encoder will usually replace variable-length "definitions" with fixed-length "references" (ex., replace a 42 character sequence with an 8-byte reference, then replace a 53 character sequence with another 8-byte reference, and so on). Entropy coding typically replaces fixed-length "symbols" (data blocks) with variable-length "codes" (shorter codes for more frequent symbols). This is why the two methods are sometimes used in sequence: data is compressed using a dictionary encoder, and then the dictionary entries themselves are entropy-coded.
An effect similar to entropy encoding can be observed in human languages. Frequent words (like "yes", "the", or "get") are shorter than infrequent words (like "otolaryngologist", "comprehend" or "dermatoglyphics").
A practical example: Let's say we want to assign a number to each type of pet that people may have, and there are 15 different types of pets. Since there are more than 10, some pets will get a single-digit number (0 to 9), while others will get a double-digit number (10 to 14). By assigning the single-digit numbers to the most common pets, we cut down on how much we'll need to write later.
One of the most common algorithms for entropy encoding is known as "Huffman coding", named after its creator, David Huffman. It works by generating codes in such a way that a code for one symbol is never identical to the start of another code. This ensures that codes can't be confused with each other, and does away with the need for special markers. For example, if we had the following codes:
Code A: [1] [1]
Code B: [0] [0]
Code C: [1] [1] [0]
Code D: [0] [1] [1]
And then came across this sequence:
[1] [1] [0] [0] [1] [1]
There would be no way to decide if it meant [C] [D] or [A] [B] [A] (because A is identical to the start of C). So we'd have to introduce some sort of marker (ex., [1] [0]), which the decoder would look for to determine where the next code would start. This marker would then be unavailable for use in any code (or we'd have the same problem all over again). By ensuring that no code matches the prefix of another code, Huffman coding eliminates the need for that "special marker". Huffman coding is relatively simple and quick to compute, which makes it a popular type of entropy coding.
Note...
Huffman coding is so popular, in fact, that the term "Huffman" is frequently applied to other forms of prefix-free entropy coding, even when they don't actually use David Huffman's algorithm.
Applying this concept (prefix-free coding) to the pet list example above, to eliminate the chance of pet numbers being confused with each other in situations where people have more than one pet (ex., confusing "1 2" with "12"), we could simply not use number "1". That way only the numbers above nine would start with a "1". Naturally, this would mean we'd have to go one number higher (ex., up to 15 instead of 14), but it would also mean we wouldn't have to use spaces between each person's various pets. For example, if your list of pets consisted of the following codes...
0, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
...then, "0221112" could only mean one pet of type "0", two pets of type "2", one pet of type "11" and one pet of type "12"). In fact, since no two codes have the same start, we would write them in any order (ex.: "1112022" or "2212011"), and it would always mean the same.
If there were more than 19 types of pets (and you had to use the number "20" to represent one of them), then you would also need to eliminate the code "2", and so on (or, in some cases, it might be preferable to eliminate the code "10" and use "100", "101", "102", etc. for the new codes; it's all about balancing the length of the codes with the frequency of the source blocks, to achieve the best compression possible).
A superior (but slower) method of entropy coding is known as "arithmetic coding". This encodes any sequence of values as a single number between 0.0 and 1.0. How does this work? The encoder starts by analysing all the data and determining which symbols (blocks of data) are more common and which are less. For example, let's say we are trying to encode this set of symbols:
A A A B B B A C
Here we have four As, three Bs and one C. So the probability of A is 50% (4 out of 8), the probability of B is 37.5% (3 out of 8) and the probability of C is 12.5% (1 out of 8).
Now that our encoder knows that, it divides the 0.0 - 1.0 interval in segments, each of which corresponds to one of the symbols to be encoded. More frequent symbols get bigger segments (meaning they require less precision, which is equivalent to having "shorter" codes). Using the example above, with a direct statistical correspondence, the intervals would be:
Symbol A: 0.000 - 0.500 (50%)
Symbol B: 0.500 - 0.875 (37.5%)
Symbol C: 0.875 - 1.000 (12.5%)
Now the encoder looks at the first symbol in our set (it's an "A"), and with that it determines that the final coding will have to be between 0.000 and 0.500. It then divides that interval in the same way described above, to see where the second symbol will "land":
Symbol A: 0.0000 - 0.2500 (50% of the interval)
Symbol B: 0.2500 - 0.4325 (37.5% of the interval)
Symbol C: 0.4325 - 0.5000 (12.5% of the interval)
Since the second symbol is also an "A", our encoder now knows that the final coding will be a number between 0.00 and 0.25. It does the same for the third symbol (again an "A"), reducing the interval to 0.000 - 0.125. The next symbol is a "B", so it lands in the 0.500 - 0.875 part of the 0.000 - 0.125 interval (in other words, it lands between 0.0625 and 0.109375). And it keeps doing that for the rest of the symbols:
5th symbol (B): 0.0859375 - 0.103515625
6th symbol (B): 0.0947265625 - 0.096923828125
Fig 6. - Arithmetic coding progressive subdivision (first 6 symbols)
7th symbol (A): 0.0947265625 - 0.0958251953125
8th symbol (C): 0.0956878662109375 - 0.0958251953125
This means that any number between 0.0956878662109375 and 0.0958251953125 represents our original sequence. Since only the first 4 digits of those numbers are identical, we only need a 5-digit number (ex., 0.0957). We have just encoded an 8-symbol sequence (where each "symbol" could in fact represent several bytes) as a single 5-digit number (plus the symbol table, with each symbol's probability, and the total length, which will also be required for decoding).
The decoder only needs to perform the reverse operation to recover the original sequence. It looks at the number we stored (0.0957), and with that it determines the first symbol (the number is between 0.0 and 0.5, so the first symbol must be an "A"). Then it divides the interval corresponding to the first symbol (0.0 - 0.25 now represents A, 0.25 - 0.4325 represents B, and 0.4325 - 0.5 represents C) and sees where the encoded number lands. It's between 0.0 and 0.25, so the second symbol is also an A. And so on, until it has recovered the full sequence.
By combining prediction algorithms (see below) with entropy coding, some lossless image compressors can achieve size reductions approaching 50% for some photographs. Entropy coding is also used in conjunction with other algorithms, such as RLE and LZ. It's used, for example, by the lossless video codec HuffYUV, the PNG image file format, several audio codecs, and some lossy compression formats (more on this below, and in the second part of this article).
Note...
Although the description of arithmetic coding above is formally correct, real-world implementations can have further optimisations, such as adaptive or non-linear interval division, numeric base optimisation and conversion, etc.. None of that is particularly relevant to understand the fundamental principle, though. If you want more information, it's relatively easy to find on the net.
We saw above how reordering the values in an image file allowed the RLE algorithm to achieve much better compression. That was only possible because we had some "special knowledge" about the kind of patterns present in an image (namely that each pixel was representd by three values). Identifying such "typical patterns" or tendencies in specific kinds of data is the key to efficient compression.
And the tendency in "natural" images (and sound files, too) is that values tend to vary smoothly. It's very rare for a photograph to have, for example, a red pixel followed by a blue pixel followed by a white pixel. Natural scenes tend to have slow progressions of colours, and this is made even more likely by the "blurriness" introduced by the capture equipment (ex., camera). This means that, once we know the colour of two pixels, we can make a reasonable prediction about the colour of the next one.
For example, if the brightness of our first two pixels is [100] and [112], we can estimate that the value of the next pixel will be [124]. Our prediction might be correct or it might be incorrect, but either way, it can help us to compress the data. How? Simple: we don't store the prediction; we store the error.
What I mean by that is that we subtract our prediction from the real value of the next pixel, and store the result. How does that help us? By now you've probably guessed it, but let's look at an example. Imagine we have this sequence:
[190] [195] [199] [202] [200] [198] [196] [193] [192]
The first value might as well be stored as it is, since we can't really "predict" it from anything. For the second value, we can "predict" there will be no variation (i.e., the predicted value is [190]). But, in fact, there was some variation (+5). So we store that:
[190] [+5]
For all values after that, we can predict that the variation will be identical to the previous one. So, for the next value, our prediction is [200] (195+195-190). But the real value is 199. So we store the difference: [-1]. The prediction for the next value is 203 (199+199-195), but the real value is [202]. So we store the difference : [-1]. And so on. If we repeat this for the whole sequence, we get this:
[190] [+5] [-1] [-1] [-3] [0] [0] [-1] [+2]
As you probably noticed, this didn't reduce the number of values we need to store (it's still one value per pixel - or three, if the pixel used 3 colour channels), but as long as our prediction was reasonably close, the actual values will be much lower than the original ones (the closer the prediction, the lower the error values). And this means two things. First, it means the values might be able to fit in less bits (just looking at the example above you can see all the errors have a single digit, while the original values had 3, so they would even take up less decimal digits, let alone binary ones). And secondly, if the values are all in a smaller range, that means there's a bigger chance that some of them will be repeated. And repeated values increase the likelihood of repeated sequences, which means we can then apply dictionary compression (or entropy coding followed by RLE, or any other form of compression) to the list of errors.
To recover the original image, the decoder would first reverse the actual compression, then start "predicting" values and correcting them by adding the stored error values. As long as the encoder and decoder use exactly the same prediction algorithm, the final data is guaranteed to match.
Note that the prediction algorithm given above is rather simplistic (looking only at two values, and assuming linear variation). Real-world algorithms can look at many more samples and make predictions based on different types of curves.
Note...
You may have heard the term "delta" associated with this technique in two ways: First, a linear prediction algorithm that works by calculating the difference between two values and then using that result can be described as a "linear delta" predictor. And second, the act of storing the difference between two values, rather than the values themselves, can be described as "delta coding". This is simply due to the fact that "delta" is the name of the Greek letter "Δ", corresponding to "D", and is typically used in mathematics to mean "difference". In other words, "delta compression" can mean many different things, and is used in several contexts (as we'll see in the final part of this guide). Interpret it simply as meaning that, at some point in the algorithm, something is being subtracted from something else, and the result - or difference between the two - is being used, rather than the original values.
Prediction and error coding (followed by dictionary compression or entropy coding and RLE) is a very successful technique for compressing "natural media" (meaning images, sounds and other data sets with an analog origin). Like most compression techniques, prediction works better on images with less fine detail, but, unlike LZ, it does not require exact pattern repetition). Prediction techniques are used by image formats such as PNG, as well as ADPCM-encoded audio. This will be discussed again in part two of this article, since these kinds of techniques are also used by some lossy compression formats.
The following charts show the compression that can be achieved by RLE, LZ and predictive encoders on different kinds of images. Note that the first chart uses a logarithmic scale; the difference between compressed and uncompressed sizes is so big that the last three bars would be almost invisible otherwise.
Fig. 7 - Compression of 24-bit 3-colour logo (logarithmic scale)
Fig. 8 - Compression of a 24-bit photo
Fig. 9 - Compression of a 32-bit (RGB + Alpha) CG render
As you can see, the advantage of the predictive compression is greater in more "natural" images (the photo). In images with high-contrast edges, prediction offers little (CG render) or no (logo) benefits over dictionary encoding.
For video files with no temporal compression, the ratios will be similar to the ones achived with stills. More details about video compression techniques will be discussed in the next articles in this series.
The chart below shows the appproximate level of compression that can be achieved by LZ and predictive compressors on three kinds of media: text (English), DAT-quality speech (48 KHz, 16 bits, mono) and CD-quality music (44.1 KHz, 16 bits, stereo). The sizes are presented as a percentage of the original (uncompressed) file size. The bars for LZ audio compression have two values; the highest value is using a "plain" LZ compressor, the lower value is using audio-specific optimisations. The difference between the music and speech compression ratios is caused mainly by the periods of silence (common in speech but almost non-existent in music), where the long sequences of low values increase the likelihood of repeated patterns.
Fig. 10 - Text and audio compression (size as percentage of original)
Despite a reasonable performance in audio compression (especially when the audio-specific optimisations are enabled), LZ-style algorithms are rarely used for this purpose. This is mainly due to two reasons: First, audio files are comparatively much smaller than image or video files, which means they can usually be kept uncompressed during production. And second, lossy algorithms can achieve much higher compression ratios without a significant loss in quality, so those are generally used for distribution.
There are several lossless audio compression formats (such as FLAC, True Audio, WavPack or Monkey's Audio) that use prediction techniques (sometimes combined with LZ, RLE and entropy coding) to achieve size reductions of up to 50% for music and 60% for speech. Unfortunately these formats are not supported by most current Windows and Mac DCC applications, so they need to be converted (ex., to uncompressed PCM) before they can be loaded into audio and video projects.
And so we come to the end of the first article in this series. I hope you found it interesting and accessible.
It's important to note that all the compression algorithms described in this article are lossless. In other words, if the data is compressed and then decompressed, the result will be exactly identical to the original. This is essential, for example, when compressing computer programs, where a change to a single bit can cause the program not to run, or make it behave in unpredictable and potentially disastrous ways. It's also advisable to use lossless compression for media during intermediate production stages, to avoid or minimise the degradation of the source material as it's processed or transferred between applications.
Note...
How much was this guide worth to you? More than your private island in the Caribbean? Less than a bag of microwave propcorn? You decide. Click the link below and enter the amount you consider fair. You will need a valid credit card or a PayPal account.
PAY WHAT YOU LIKE!
If you have any questions or comments about this guide, post a message in the forum. If you quote this guide on other websites, in your e-mails, academic papers, etc., please include a link to this page.
In the next part of this article we'll look at situations where some changes to the original data are considered acceptable (lossy compression), and how some of those techniques can be used to improve lossless compression as well.
- Part 1 - Lossless data compression
- Part 2 - Lossy compression of stills and audio
- Part 3 - Video compression