Foreword

Foreword

This is the third and final part of a three-part article about data compression, with particular focus on media (images, audio and video). It does not require any knowledge of information theory, mathematics or programming. I have tried to use language accessible to anyone without sacrificing the correctness and thoroughness of the information provided.

Use the following links to display each part of the article:

When some issues are considered more complex (or less relevant), they will be placed inside a note box. You don't need to read the notes to understand the fundamental principles being described, but you might want to check them out anyway, especially if you noticed a potential problem with some special case of the processes described in the previous paragraph(s).

Binary digit (bit) will sometimes be abbreviated as "b", and bit octet (byte) will sometimes be abbreviated as "B". Those abbreviations might be prefixed by standard SI magnitude symbols like "k" (kilo), "M" (mega) or "G" (giga). So "kb" stands for "kilobit", "kB" stands for "kilobyte", "Mb" stands for "megabit", and so on. When the number is represented using a "binary magnitude" (i.e., using a factor of 1024, instead of 1000), it will use the IEC binary prefixes instead (kiB, MiB, GiB). See the FAQ for more details about this distinction.

Introduction

Introduction to part 3

In the first part of this article we looked at lossless data compression algorithms, as well as at some of the techniques that can be used to improve their performance. In the second part we looked at some of the techniques used in lossy compression (where the data obtained after decompression is not necessarily identical to the original), and how some of those can be adapted to improve lossless compression.

In this third part we're going to look at how the concepts and techniques described in the first two parts can be applied to video compression, as well as some unrelated techniques used by modern video compressors.

Note...


Posting uncompressed videos online would make this article very slow to download, and posting videos using multiple compressors would force the readers to download and install all the codecs used. In addition to that, browsers' built-in video player plug-ins are not designed to give fine control over playback (ex., frame-by-frame, forwards and backwards, etc.). For these reasons, I have decided to illustrate the compression techniques with sequences of stills, rather than video files.

Moving pictures

Moving pictures

A painting during an earthquake, that's a moving picture. Video clips and movies, on the other hand, generally consist of a sequence of individual images, shown in quick succession. Similarities between these images trick our brain into thinking that it's seeing moving objects (or a moving point of view, etc.).

This gives us our first (and most straightforward) video compression technique: simply compress each image individually, using any of the compression techniques described in the previous two parts of this guide. This is how the popular video editing codecs MJPEG and PhotoJPEG work: they compress each frame as a JPEG image. DV works in a similar way (also using DCT-based compression, but with some specific optimisations for use in a linear medium like tape), as do several other codecs (some lossy, some lossless).

This type of video compression, where each frame is compressed independently, is known as "intraframe" compression ("intra" is a Latin word meaning "contained" or "within a boundary").

A few technical details are worth pointing out, though.

First, there's the issue of speed. Our playback device must be able to decompress the images quickly enough to display them at the required speed (ex., 25 or 29.97 frames per second). And if we want to be able to record images in real time (ex., from a camera), the compressor must also be able to compress them quickly enough (alternatively, if we have enough space and bandwidth, we can capture the images with no compression and compress them later, without the real-time requirement, or even keep them uncompressed the whole time, although that is unlikely to be viable for broadcasting or distribution).

Then there's the matter of the file structure. Still image files generally start with some information about the file itself (what is known as the "file header"). This includes the type of compression, the number and precision of the colour channels, the image dimensions, etc.. Since most of these elements are unlikely to change during playback, and since they are likely to be included in the movie file's own header, they can be stripped from the individual frames to save space.

Saving space isn't always the #1 priority, though. This is a good time to point out the difference between a digital video file and a digital video stream.

A "file" is something with a well-defined beginning (and, usually, a well-defined end), that is fully accessible to the playback device at all times. When you include frames 350 to 400 of an AVI file in your NLE's timeline, the software opens the AVI, looks at its header to determine the compression type, resolution, etc., and then processes as much of the compressed data as it needs in order to extract the required frames.

Note...


Note that frames can have different sizes, in bytes (ex., a black frame will usually compress more than a complex image), so it's not always easy to determine which part of the compressed file corresponds to which frame(s). This means that, even if the file uses intraframe compression, the decoder might need to read some extra data, to help it locate the requested frames (there are several techniques to do this: using indexes, performing binary searches, padding frames to keep an exact bitrate, etc.).

This kind of file access isn't always practical, however. Let's say you decided to watch a live internet video broadcast, but you logged into the site 15 minutes after it started. Does this mean your browser has to download those 15 minutes before it can play back the "current" part of the video (which by then won't even be "current" anymore)? Well, if the broadcast is done using a traditional video file, then yes. This is where streams come in.

A "stream", as the name suggests, is something that does not necessarily have a well-defined start or end, it just flows by. Specifically, in the case of digital video streams, it refers to data organised in such a way that it can be decoded without access to "distant" parts of the data (like the beginning or the end). This can be achieved simply by introducing into the stream, at regular intervals, the information from file header (ex., before each frame, or once per second, or once every 100 frames, etc.). This information is usually prefixed by a "special" sequence of bytes (that never occurs in the compressed data itself), so the decoder can identify it.

Note...


Note that, even in the case of intraframe compression, this refers to the information in the video file's headers, not the headers of the individual stills, since those wouldn't have information about things such as frame rate, audio, etc..

Not all internet video uses true streams, though (in fact, most doesn't). Many players use the term "streaming video" to describe the ability to start playing back a file before it has been fully downloaded, but they still need to download the start of the file (you can't play the last 2 seconds without downloading everything that came before them). This isn't a true video stream (which requires dedicated servers and protocols). As a general rule, on the internet, only live broadcasts use real streams. A good example of a stream format outside the internet is the way data is recorded to digital video tapes (any part of the tape can be played back without access to the start of the tape).

Note...


Some popular video-hosting sites use a hybrid solution, where the video is kept as individual files (ex., in FLV format) and a simple server-side script receives "seek" commands from the client player, and then serves the appropriate part of the file. This only works with files that have been previously indexed, so the server knows which part of the data corresponds to which frame, and it still requires that the beginning of the file is downloaded (so the player has access to the file headers). While technically not true streaming video, this approach is almost indistinguishable, from the site visitors' point of view.

Using a stream, therefore, requires slightly higher bandwidth (or, if the bandwidth is fixed, leaves less bandwidth for the video itself), but that can be more than offset by the practical advantages of being able to decode a segment of the video (and audio, of course) without having to download the whole thing from the start. If you know people will always have the full video data available (ex., because you're distributing it on a CD-ROM), it makes perfect sense to use a normal video file, with as little repetition as possible, to save space.

Description

Animation as description

In the previous section I briefly mentioned vector graphics. Vector graphics are a way to describe images as a series of geometric shapes, parametric deformations and procedural textures. For example, instead of storing a red circle over a black background as a 640x480 image, you can store it as a list of vector entities, like this:

Resolution: X=640px, Y=480px
Background: R=0.0, G=0.0, B=0.0
Object01: Type="Circle", R=1.0, G=0.0, B=0.0, X=0.5, Y=0.5, Radius=0.25

Where X, Y and the Radius are represented as a fraction of the image width, height and diagonal, and where R, G and B are represented in a floating-point scale. The example above is obviously not an optimised representation; it's meant to be easily readable, but still, it only takes up around 130 bytes, whereas a 640x480 bitmap (24-bit RGB, uncompressed) would take up 900 kiB. Even with compression, it would probably still take up more than 130 bytes. But that's not the point right now. The point is that we can also use vectors to animate the elements in our vector image. For example:

Keyframe03: Target="Object01:X", Value=0.4, Time=00:00:05.000

The line above could instruct our "vector animation player" to change the value of the "X" property of object "Object01" to "0.4" at the timecode "00:00:05.000". Exactly how that change would occur (suddenly at that frame, or slowly interpolated from the previous keyframe, etc.) would depend on other settings, which aren't relevant right now; this is just a simple readable example, not a real-world implementation.

So, just as vector graphics can take up far less space than bitmaps, so can vector animation (also known as parametric animation) take up a lot less space than traditional video files, where each frame has to be built "from scratch". Of course, just as vector graphics are only suitable for certain types of images, so is vector animation only appropriate for certain kinds of motion.

It's important to note, however, that vector animation can be applied to bitmaps (not just to geometric vector graphics). This is, in fact, one of the main uses of a popular internet browser plug-in called Flash (if only people would use it for something useful, instead of forcing visitors to sit through hideous and unbearably long website "intros").

Note...


Note that the paragraph above refers to the base Flash platform, not to Flash Video, which is really a conventional "bitmap-based" video codec that happens to be built into the Flash web browser plug-in.

Let's say you have a banner at the top of your web page that consists of a pan across a landscape while your logo bounces up and down. You could store this as a "full-motion video" file, where each frame is essentially a separate bitmap. Or you could store it as three elements: the background bitmap, the logo (which can be a bitmap or a vector object) and a list of vector animation commands to make the background scroll while the logo bounces.

In a way, vector animation files are similar to the "project" or "composition" files used by NLE and compositing software, in the sense that their output combines the input files with a series of transformations, effects and transitions. The main difference is that vector animation files must be simple enough to play back in real time, while NLE and compositor project files are meant to be rendered to a ("full-motion") video file, so they don't need to play back directly.

One advantage of vector graphics is resolution independence. If we take our "circle" example from above:

Resolution: X=640px, Y=480px
Background: R=0.0, G=0.0, B=0.0
Object01: Type="Circle", R=1.0, G=0.0, B=0.0, X=0.5, Y=0.5, Radius=0.25

And decide we want it to be 2048x1536 pixels instead, we only need to change the first line. Everything else uses a relative scale, so the image is simply "re-rendered" at our new resolution, preserving full detail. If we scale a bitmap from 640x480 to 2048x1536, chances are it will end up looking either pixelated or blurry (depending on which resize algorithm is used).

Vector animation has a similar advantage, but applied to time. When a normal video clip is slowed down (ex., to 10% of its original speed), motion becomes less fluid; objects appear to remain still for a while and then jump from that position to the next. There are ways to disguise this (such as making a cross-fade between consecutive frames, or using CPU-intensive algorithms to "fake" new, interpolated frames), but with vector animation there are no such problems. You simply move the keyframes apart, keeping the same frame rate, and the player will generate more intermediate frames. Or, alternatively, you increase the frame rate (keeping the keyframes where they are), and motion becomes more fluid. This is assuming your player has enough computing power to generate that number of frames per second, of course.

A smart player might analyze your system's speed, determine how many frames per second it can reliably generate, and adjust the playback frame rate so that people with faster systems get a better experience. An even smarter player might change the playback speed dynamically, in real time, depending on the complexity of each frame and the time it takes to render it. Flash doesn't actually do this (in fact, Flash animations normally have a preset frame rate and won't increase it even if your system is 10x faster than required), but there is one type of software that does: games.

Most PC games will constantly adjust the frame rate depending on the complexity of the images they are rendering. Some games will also adjust the complexity (level of detail) of the images and 3D models, if they decide that the frame rate is too low or unnecessarily high.

So what does all this have to do with video compression? Well, vector animation can be considered as a form of "motion information compression" and, as we'll see later, that concept can be applied to traditional video as well. And variable frame rate is an example of how we can get variable amounts of information (i.e., a different number of frames per second) from the same source data.

Note...


Vector animation has gained more relevance with a recent concept called "machinima". Machinima stands for "machine cinema" or "machine animation", and refers to the creation and playback of movies using the real-time capabilities of videogame engines. This isn't merely a low-budget, fast-production alternative to traditional CGI; machinima allows "live" performances, where "puppeteers" control 3D models, cameras and other in-game elements, and spectators watch in real time, using their systems to render the images based on the vector information arriving through their network connections. Although everyone receives the same source data (assuming their connections are fast enough and error-free, of course), each system will render different images, depending on its capabilities and user preferences.

Most machinima is still distributed as traditional video files, though, with a fixed quality and frame rate, so that everyone can watch it, without having to buy and install the game (or stand-alone 3D engine) used to create it.

Temporal delta

Temporal delta encoding

And now for something not quite completely different.

So far we've looked at intraframe video compression, where each frame is compressed independently, and parametric (or vector) animation, where only changes between parameters (coordinates, colours, angles, etc.) are stored. Now let's look at ways in which the two concepts can be combined.

As you might recall from the previous part of this guide, one way to recover the original image from a lossily-compressed copy was to combine it with another image (called "delta"), generated by subtracting the other two. Although, in that case, we were subtracting (that is, determining the per-pixel difference between) two "versions" of the same image, there is no reason why we can't determine the difference between two different images. Namely, between two consecutive frames in a video clip.

And, here too, the smaller the differences between the images, the lower will be the values in the delta image. And the lower the values, the more compressible it will be (either with lossless or lossy compression algorithms).

The following figure shows two frames from an animation, followed by the (absolute value) delta between those images.

Temporal delta

Fig 1. - Delta between consecutive frames


The original images are 640x480, 24-bit RGB files. The delta image takes up only 15% of the original frames, using lossless compression.

This technique (of storing only the difference between consecutive frames) is known as "temporal delta encoding", and has been used by animation software for almost three decades (ex., Cyber Paint's SEQ format, Autodesk Animator's FLI / FLC format, etc.). Temporal compression (of any type, not just delta) is also known as "interframe compression" ("inter" means "between", in Latin).

Delta compression can be quite efficient for cartoon-style animation, but has three big problems for video. The first is that, the higher the number of colours in the image, the higher the chance that a pixel will have different colours in two consecutive frames (cartoons tend to use fewer colours than video). The second is that, even for areas with no motion, video noise (grain) can cause pixels to change colour, adding "detail" to the delta image (and therefore making compression less efficient). The third is that something as simple as a camera pan (which is far more common in "real" video than in cartoon-style animation) will cause big changes in the delta, although much of the image is identical between the frames (just at different coordinates).

We can deal with the first two problems by quantizing the colour palette, or by implementing a "temporal tolerance" value for pixel colour variation. But those solutions do very little for the third problem. If only we could apply the concepts from vector animation to video, and somehow encode one frame of the camera pan as a parametric transformation of the previous one...

Motion compensation

Motion compensation

And now for something even less completely different.

Let's say we have a video clip that consists of a camera pan across a landscape. Frame #2 is essentially identical to frame #1, but everything is shifted 14 pixels to the left. So we could encode this frame by storing a motion vector:

[Dx=-14, Dy=0]

Followed by a bitmap containing only the new part of the image (a 14-pixel wide "slice" to be added on the right edge).

This procedure (which we'll look at in more detail below) is know as "motion compensation" or "motion-compensated interframe prediction". The terms "motion search" and "motion vector" are sometimes also used to describe this type of encoding technique, but they refer to specific parts of the algorithm, and not to the concept as a whole.

Of course, it's very unlikely that the new image is exactly identical to the previous one shifted left by 14 pixels (due to video noise, lens distortion, etc.). So how do we solve that? Simple, we add a difference image, calculated by subtracting our "shifted" version of frame #1 from the real frame #2. In fact, since we're going to add that, we don't even need to store the new "slice"; it'll simply be included in the difference image.

The following figures show two frames from a camera pan, followed by the delta between them, followed by the delta between frame #2 and a horizontally shifted copy of frame #1. All sizes refer to the original frames (720x576 pixels, 24-bit RGB, saved in JPEG format using Photoshop CS2, with the quality slider set to "8").

Pan frame 1

Fig. 2 - Pan frame #1 (101 kiB)

Pan frame 2

Fig. 3 - Pan frame #2 (100 kiB)


Frame delta

Fig. 4 - Delta between #1 and #2 (121 kiB)

Shifted delta

Fig. 5 - Delta after shifting (53 kiB)


Although there are some differences in the scene itself, the "motion compensated delta" is less than half the size of a full frame. The straight delta between the two frames actually ends up bigger than the original frames, because the DCT algorithm can't compress it as efficiently (remember, DCT compressors work best with "natural" images, which isn't the case of the delta). The motion vector itself (which also needs to be stored, for the shifted version) takes up only a couple of bytes, so it's not relevant for the final size.

In this context (where the difference information is used to "correct" the result of a vector transformation), the term "residual" is normally used instead of "delta". It's basically the same process, but using different terms makes it easier to understand that (in the case of motion compensation algorithms), the difference information is used to "clean up" after the main encoding technique, rather than to encode the image directly.

Note...


In the case of algorithms with a DCT stage, the residual can be calculated either in pixel or frequency space (in other words, instead of subtracting pixel values, the encoder can encode the blocks first and then subtract the DCT coefficient matrices). Thinking about the process in terms of pixel data makes it much easier to visualise, though, so here I will use actual images instead of DCT tables.

It's important to note that motion compensation is a technique for the compression of full-motion video (i.e., movie clips, where each frame was originally a bitmap, whether generated by a computer or recorded with a camera). Vector animation by contrast, is built from the start as a series of parametric effects and transformations applied to graphical objects (lines, geometric shapes, text, bitmaps, etc.). If you rendered a vector animation to a video file (ex., an AVI) and then compressed it using a motion compensation algorithm, the end result would probably be bigger (and look worse) than the original vector animation.

We now have a way to "encode" camera pans using just two numbers, which, while not perfect (due to video grain, etc.), can greatly reduce the size of the residual (compared to a "straight" delta). But what if, instead of a camera pan, we have some objects moving across the frame? Wouldn't it be nice if our algorithm could identify shapes, figure out how they moved from one frame to the next, and encode that as a motion vector, so that the residual can be simplified?

Well, identifying shapes (ex., a car, or a person, moving across the frame) is a very complex task, which heavy-duty AI systems are only now beginning to tackle. But if you're familiar with a process called "tracking", you know that something similar can be achieved in real-time (or even faster) by graphics workstations and high-end personal computers.

Trackers work by taking a rectangular selection from one frame and looking for the most similar area in a different frame (or in multiple frames), within a "search area" surrounding the previous position (ex., a maximum distance of 40 pixels). The tracker then stores the coordinates of the best match, which can later be used, for example, to pin a CG element to a specific object in the background video. By always using a rectangle as the tracker's "reference", no complex shape analysis needs to be made. This means some care must be taken when defining the source (to avoid divergent motion, for example), but a human can make that decision quickly and (with a bit of experience) accurately.

So, how can we adapt this to video compression? We can't really ask the user to manually define tracking sources for every object in a scene. And if we track the image as a whole, it'll only be good for camera pans. The solution is to divide the image into a grid of blocks and track all of them. If the blocks are too big, it'll be hard to get good matches (because their content is unlikely to move as a whole). If the blocks are too small, we'll end up with so many motion vectors that we don't save any space. A typical size used by real-world encoders is 16x16 pixel blocks. In other words, a standard-definition PAL frame is divided into 1620 blocks (1350 for NTSC), and each of them is then "tracked".

The amount of processing power required to do this depends on the amount of motion in the source video, as well as its predictability. For example, if a block moved 4 pixels to the left between the previous two frames, the encoder might start its search for the next match by assuming it moved another 4 pixels (that is, start at [-4,0]) and then try [-4,-1], [-5,0], [-4,1], [-3,0], and so on, in a series of concentric circles. Since nearby blocks tend to move in similar ways, basing a block's motion vector on the average of nearby blocks also helps speed up the tracking (and reduce the amount of data specific to each block, improving compression). Thanks to these kinds of optimisations, and to the fact that most live video has large areas with no motion at all, motion search can often be performed on standard-definition video in real-time, even using a home PC. HD (high-definition) video has approximately 5 times more data, so it's still hard to encode in real-time without faster (or specialised) processors. Some high-end GPUs can be used to accelerate some stages of video compression (and decompression).

Note...


Most motion compensation encoders store motion vectors using a half-pixel or quarter-pixel scale. This isn't really important to understand the fundamental principle, so I won't go into much detail here. A search for "sub-pixel motion" using your favourite internet search engine should return plenty of results.

Let's look at a simplified example, using an enlarged detail from a video clip. These are its first two frames:

Frame #1

Fig 6. - Frame #1 (enlarged detail)

Frame #2

Fig 7. - Frame #2 (enlarged detail)


The first frame is stored using intraframe DCT-based compression, which was described in the previous part of the guide. The second frame is going to be encoded using motion compensation, so we start by dividing it into blocks (sometimes called "macroblocks"):

Macroblocks

Fig 8. - Macroblock division


To avoid making the image too cluttered, let's look at just four of these blocks, labeled A, B, C and D:

Blocks A, B, C and D

Fig 9. - Blocks A, B, C and D


The motion search algorithm will try to locate the most similar blocks in the (compressed version of the) first frame. It does this by testing multiple positions around the original location (sometimes using motion information from previous frames to make a "guess" about the best place to start). The image below shows the best matches found in frame #1:

Source matches

Fig 10. - Best matches on frame #1


The final residual is not calculated relative to the (lossily) compressed version of the first frame, however. It's calculated relative to the original (uncompressed) frame #1. This avoids cumulative errors and can in fact improve quality from one frame to the next (by correcting the error in the original frame's DCT quantisation). If a block is an exact match, its residual will appear as pure, uniform black. The bigger the difference, the higher (brighter) the absolute values in the residual. Here we were able to find pretty good matches, and although the block residuals aren't perfectly black, the values are low and reasonably uniform, meaning the residual will be quite compressible.

Best matches (detail)

Fig 11. - Matched blocks and residual (enlarged)


Now let's look at the relative position between the image blocks in frame #2 and the best match locations in frame #1:

Relative position

Fig 12. - Relative position (motion vector)


Unsurprisingly, block A hasn't "moved" at all. The best match is at the same location, which means its motion vector is [0,0]. Blocks C and D have moved horizontally, by different amounts (the car is moving faster than the pedestrian's leg). Their motion vectors are [-6,0] and [-4,0], respectively. But something curious has happened with block B. Although the car is moving horizontally across the frame and the building is perfectly still, block B's motion vector is [0,-5]. In other words, is has "moved" vertically by 5 pixels. This illustrates that, although this technique is called "motion compensation" and the offset values are called "motion vectors", they don't necessarily describe the motion of objects in the scene. They simply encode the difference in position between an image block and its best match in a different frame. Here all the vectors are either vertical or horizontal, but diagonal vectors (ex., [4,3]) are just as common.

Frame #3 (not shown here) would be compressed the same way: the encoder would look for matches in (the encoded version of) frame #2, and store the residuals relative to (the original version of) frame #3. And so on for all the other frames.


Note...


In a previous section I mentioned that some slow-motion algorithms try to generate intermediate frames to preserve motion fluidity (an effect sometimes known as "retiming"). This is done by applying a motion search algorithm (just like the ones used by motion compensation compressors), and then using the motion vectors to perform a kind of "automatic morphing" between two frames, inserting enough new frames to achieve the intended playback speed.

Motion-compensated frame interpolation has made its way into several post-production and compositing software packages, sometimes as part of the standard effects, sometimes as a separate plug-in. While not as perfect as using a real (and usually expensive) high-speed camera, slow motion generated this way can look very realistic, especially if the original footage did not have very fast motion and was blur-free.

In situations such as the (apparent) vertical motion of block B, in the example above, these interpolation algorithms run into some problems, and produce strange results (because the intermediate frames will show the apparent motion of the blocks, not the real motion of the objects in the scene). Some post-production packages allow the user to adjust the motion vectors manually, prior to final rendering, to deal with these situations.

Some TV sets use a simplified version of this technique to double the scan frequency or de-interlace video. The same technique is also used by some PAL - NTSC converters. See the links at the end of this article for more details.

MPEG encoding

Practical aspects of MPEG encoding

Several modern video codecs use these motion detection and encoding techniques. Since these were first standardised by the MPEG (Moving Picture Experts Group) working group, they are often referred to as "MPEG-style" encoders. These include MPEG-1, MPEG-2, MPEG-4, DivX, XviD, AVC, H.264, VC-1, VP6, etc. (some of these follow the specifications of the MPEG working group, while others were developed independently, but they all use similar techniques).

Note...


It's important to note that MP3 does not stand for MPEG-3, and is not a video compression codec. There were indeed plans for a video codec called MPEG-3, but they was abandoned in favour of a modified version of MPEG-2. MP3 stands for "MPEG-1 Audio Layer 3".

Some decoders have the option to superimpose the motion vectors onto the video. The following image was generated by playing back an animation (compressed in MPEG) using FFDShow.

Motion vectors overlay

Fig 13. - Motion vectors overlay

Notice how two blocks at the middle of the piano show extremely small motion vectors, since they were uniformly black (or dark gray) in both the frame being displayed and the previous one. This is another example of how the "perfect match" might not correspond to the motion perceived by the human brain.

The short but not-quite-zero motion vectors in the white background are caused by noise in the original video (the video was originally captured from an analog source, and has a small amount of grain, almost invisible but enough to offset the "ideal match" by 1 or 2 pixels, making the compression slightly less efficient).


Encoding every frame as a transformation of the previous one can achieve some pretty impressive results in terms of compression ratio, but it has a few inconvenients.

If we need to display frame #500, we'd have to decode all the frames since the start, since #500 is based on #499, which in turn is based on #498, and so on. This can take a long time even on a high-end computer, let alone a cheap, relatively slow processor like the ones found in handheld or set-top players. So how can we solve this? Simple: we introduce "keyframes" at regular intervals along the video. These keyframes are compressed like the first one (using intraframe compression). When the decoder is asked to return frame #500, it'll only have to go back until it locates a keyframe, and then decode frames from there. If we insert a keyframe once a second, for example, even a real-time decoder will only take about 1 second to "seek" to the requested position (most decoders can work faster than real time, so the real delay would probably be something under 1/4 of a second).

Another problem is that, if we have a scene change at some point in our clip, the first frame of the new scene probably won't have any blocks similar to the previous one. That means a lot of data will have to be stored in the residual, and the motion vectors are more or less useless. Since most video compressors operate with a fixed bitrate this means that the bits used to store the motion vectors are taking up bandwidth that could be put to better use by storing the whole image as an independent frame. Therefore, whenever there's a scene change (or indeed any frame that's very different from the previous one), it would be desirable to encode it as a keyframe. This can make our keyframe distribution irregular, but will improve the quality of the video clip. Most modern encoders can do this automatically: if they detect that two frames are so different that it would be a waste to use motion compensation, they encode the second one as a keyframe.

In MPEG terminology, the keyframes are known as "I-pictures" (I for "intraframe") while the frames based on previous ones are known as "P-pictures" (P for "predicted").

This is a good time to mention a couple of other terms that are commonly used when talking about MPEG-style video encoding: "GOP", "profile" and "level".

No, there's nothing political about it. GOP stands for "group of pictures" and refers to the way picture types are organised inside the MPEG file. For example, if a file is said to use a GOP with the structure "IPPPPPPP\", that means it will encode the first frame as an I-picture, the next seven frames as P-pictures, and then start a new GOP. The backslash at the end is just to make clear that the GOP ends there.

GOPs always start with an I-picture, and their maximum length affects two things: how much memory the decoder must have available, and how long a delay there will be between reading the compressed data and being able to decode it (this is important mainly for "real-time" broadcasts). Since players might need to "go back" a full GOP in order to decode a frame, they need to have enough memory to load the complete GOP (the alternative would be to go back one frame at a time in the source media, which can take a long time). Since devices like set-top players are meant to be cheap to produce (compared to a PC, for example), distribution formats (like DVD) normally use a relatively short GOP limit (ex., 15 frames for PAL and 18 for NTSC). This allows them to react quickly to user commands without the need for large amounts of memory or very fast processors.

Note...


GOPs are usually allowed to be longer in NTSC because NTSC frames have lower resolution than PAL frames. In other words, more frames can fit in the same amount of memory.

And that brings us to the other two concepts. A "profile" is essentially a subset of the original standard (ex., "allow only I-pictures"), and a "level" is a set of constraints to the values of encoding parameters (ex., "limit resolution to 320x240 pixels"). These are used to ensure that a given video clip can be played back on a given device.

DVDs, for example, are required to comply with MPEG-2 "Main Profile" and "Main Level" (usually abbreviated as "MP@ML"). This means that any file that goes beyond those limits might not play back correctly on a set-top player (even if it plays back on a PC, which has more memory, a faster processor, more complete decoding software, etc.).

Note...


In fact, the DVD standard imposes a few restrictions that go beyond MP@ML. For example, MP@ML allows bitrates up to 15 Mb/s, while DVD is limited to 9.8 Mb/s. MP@ML allows any resolution up to 720x576, while DVDs can only use a few specific resolutions, and so on. If you are interested in the exact details, check the links at the end of this article.

There is a third kind of frame in MPEG (in addition to I and P): bidirectional frames, or "B-pictures". And this is where it starts to get weird.

B-pictures can be based on blocks from frames that came before them (just like P-pictures) but also on frames that come after them. This means they are more likely to find a good match, and thus achieve better compression. I'm sure you spotted a problem with this approach: if frame #2 is based on frame #15, how can the player show frame #2? Well, the answer is pretty straightforward (it needs to decode frame #15 first), but it comes with a few implications. Imagine you had a GOP like this (frame number in blue, picture type in magenta):

1I 2B 3B 4B 5B 6B 7B 8B 9B 10B 11B 12B 13B 14B 15P

To decode frame #2 in that GOP, our player would need to read all frames until #15 (because frame #2 is a B-picture, based on two non-B-pictures: #1 and #15).

Now, remember the video needs to be playable in real-time. And let's say we are using a relatively slow medium, like a floppy disk, that can only transfer 100 frames per second into memory. If our playback speed is 25 frames per second, this means we have a "comfortable" margin of 400% (our media can transfer data 4x faster than necessary for real-time playback). But is it really that comfortable? Let's see what would happen in the case above (remember, we can read four frames while each frame is being displayed; each "tick" corresponds to 1/25th of a second):

Tick 1 (playback started)
Read frame #1 (I)
Read frame #2 (B)
Read frame #3 (B)
Read frame #4 (B)

Tick 2 (display frame #1)
Read frame #5 (B)
Read frame #6 (B)
Read frame #7 (B)
Read frame #8 (B)

Tick 3 (oops! we can't decode frame #2 yet - show #1)
Read frame #9 (B)
Read frame #10 (B)
Read frame #11 (B)
Read frame #12 (B)

Tick 4 (oops! we can't decode frame #2 yet - show #1)
Read frame #13 (B)
Read frame #14 (B)
Read frame #15 (P) (finally we can decode frame #2)
Read frame #16 (I) (start of next GOP)

Tick 5 (we can now show frame #2)

(etc.)

We've just delayed playback by 2 frames (ticks 3 and 4), probably screwing the audio and video sync, and causing the customer to demand a refund.

But wait, there is a GOP length limit. We know that, at most, we'll have to read 15 frames before we can decode any frame in the GOP. So our player could work this way instead:

Tick 1 (playback started)
Read frame #1 (I)
Read frame #2 (B)
Read frame #3 (B)
Read frame #4 (B)

Tick 2 (wait)
Read frame #5 (B)
Read frame #6 (B)
Read frame #7 (B)
Read frame #8 (B)

Tick 3 (wait)
Read frame #9 (B)
Read frame #10 (B)
Read frame #11 (B)
Read frame #12 (B)

Tick 4 (wait)
Read frame #13 (B)
Read frame #14 (B)
Read frame #15 (P)
Read frame #16 (I) (start of next GOP, we can begin playback)

Tick 5 (display frame #1)
Read frame #17 (B)
Read frame #18 (B)
Read frame #19 (B)
Read frame #20 (B)

Tick 6 (display frame #2)
Read frame #21 (B)
Read frame #22 (B)
Read frame #23 (B)
Read frame #24 (B)

Tick 7 (display frame #3)
Read frame #25 (B)
Read frame #26 (B)
Read frame #27 (B)
Read frame #28 (B)

Tick 8 (display frame #4)
Read frame #20 (B)
Read frame #30 (B)
Read frame #31 (P) (end of this GOP, let's flush the first one)

Tick 9 (display frame #5)
Tick 10 (display frame #6)
Tick 11 (display frame #7)
Tick 12 (display frame #8)
Tick 13 (display frame #9)
Tick 14 (display frame #10)
Tick 15 (display frame #11)
Tick 16 (display frame #12)
Tick 17 (display frame #13)
Tick 18 (display frame #14)
Tick 19 (display frame #15)

Tick 20 (display frame #16)
Read frame #32 (I) (start of next GOP)
Read frame #33 (B)
Read frame #34 (B)
Read frame #35 (B)

(etc.)

In other words, as long as we buffer a full GOP (that is, load it fully into memory before we begin playback), we guarantee that we'll never drop frames during playback. This is assuming our source media is able to sustain its speed, of course - if the data is coming through a variable-speed network connection, all bets are off, and we probably should buffer a lot more than one GOP.

This method works, but it introduces a delay between what is being read and what is being played back (the higher the GOP limit, the bigger the delay). This can make our player feel a bit unresponsive. Maybe there is some way to minimise that delay, and let our decoder buffer as little as possible before it can begin playing back. Here is our GOP again:

1I 2B 3B 4B 5B 6B 7B 8B 9B 10B 11B 12B 13B 14B 15P

But this time, let's store it in our file like this:

1I 15P 2B 3B 4B 5B 6B 7B 8B 9B 10B 11B 12B 13B 14B

All we did was store frame #15 right after frame #1. It's still frame #15 (meaning it should only be displayed after frame #14), but it's stored before the B-pictures that depend on it. How would our player behave now?

Tick 1 (playback started)
Read frame #1 (I)
Read frame #15 (P)
Read frame #2 (B)
Read frame #3 (B)

Tick 2 (display frame #1)
Read frame #4 (B)
Read frame #5 (B)
Read frame #6 (B)
Read frame #7 (B)

Tick 3 (display frame #2)
Read frame #8 (B)
Read frame #9 (B)
Read frame #10 (B)
Read frame #11 (B)

(etc.)

By loading frame #15 right at the start, we were able to decode frame #2 as soon as we loaded it, which let us start playback almost immediately without fear of dropping frames later. In fact, we don't even need to read data that fast. Thanks to our new frame order, the decoder could work like this:

Tick 1 (playback started)
Read frame #1 (I)
Read frame #15 (P)

Tick 2 (display frame #1)
Read frame #2 (B)

Tick 3 (display frame #2)
Read frame #3 (B)

(etc.)

In other words, our decoder could now work fine even if we were only able to read 2 frames per tick.

Needless to say, MPEG encoders do exactly what was described above: they store the frames in the order they will need to be decoded in, not the order they will be displayed in. Decoders will start playback as quickly as possible and use the excess speed of the source media to buffer a couple of GOPs in the background, so they can deal with any read delays that may happen later.

Although B-pictures typically result in better compression, they have a few disadvantages (besides the added complexity in the encoder and decoder). First, they take longer to encode (because the encoder must look for block matches in two frames). Second, in situations where the bidirectionality doesn't help (ex., because the following frame has no similar blocks), they can produce worse results than P-pictures. And third, to avoid cyclic dependency situations, they cannot normally be used as a source for the compression of other frames. In other words, even if a B-picture manages to encode the original video frame perfectly, the next frame cannot re-use any of it.

The above limitations apply to MPEG-1 and MPEG-2 (used by the VCD and DVD formats, for example). Newer formats, based on MPEG-4, eliminate some of these problems by giving the encoder more freedom:

In MPEG-2, P-pictures are always based on the previous (I or P) picture. In MPEG-4 they can be based on any preceding picture (within the GOP), including B-pictures.

In MPEG-2, B-pictures are always based exactly on the previous and following (P or I) picture. In MPEG-4 they can be based on any other pictures in the GOP, again including B-pictures. It's up to the encoder to make sure there are no cyclic dependencies (ex., if frame #3 depends on frame #4 and frame #4 depends on frame #2, then frame #2 cannot depend on frame #3, and so on).

These changes, coupled with longer GOPs and other improvements (such as variable-size blocks), make MPEG-4 significantly more efficient than MPEG-2, typically achieving a 50% or higher increase in compression, for the same visual quality.

There ain't no such thing as a free lunch, though, and that increase in compression comes with an even bigger increase in processing. MPEG-4 encoders can work in real-time, but they have to take some shortcuts which make the end result only marginally better than MPEG-2. For that matter, even with MPEG-2, working in real-time makes it impossible to apply some techniques (such as multi-pass VBR encoding) that help to improve the quality vs. size ratio.

Note...


MPEG-4 is still a developing standard, and different MPEG-4 codecs may not be 100% compatible. Typically, MPEG-4 codecs implement either MPEG-4 part 2 (also known as "Simple Profile" or "Advanced Simple Profile", which is similar to MPEG-2) or MPEG-4 part 10 (also known as "Advanced Video Coding" or "H.264"). Theoretically, encoders based on part 10 can achieve a much better quality vs. size ratio than those based on part 2, but current AVC implementations aren't quite mature, so (especially at lower resolutions and bitrates) SP / ASP encoders (such as XviD) often deliver better results. That is likely to change as AVC encoders improve.

MPEG facts

A few facts about MPEG

Conclusion

Conclusion

And so we come to the end of the third and final part of this compression primer. I hope you've found it interesting and accessible. The subject of compression is a vast one, and these articles are by no means thorough, but I believe they contain enough information to give readers a good understanding of the fundamental principles behind the most common compression algorithms, and perhaps make them curious enough to look for more detailed information elsewhere.

Note...


How much was this guide worth to you? More than your diamond-encrusted sportscar? Less than a packet of chewing gum? You decide. Click the link below and enter the amount you consider fair. You will need a valid credit card or a PayPal account.

PAY WHAT YOU LIKE!

If you have any questions or comments about this guide, post a message in the forum. If you quote this guide on other websites, in your e-mails, academic papers, etc., please include a link to this page.

In the unlikely event that you got here without reading the first parts of this article, use the following links:

Below is a list of links to pages with more information about these subjects. Many of them are to Wikipedia articles, some of which are useful mainly for the "external links" at the end (no sense in repeating all of them here). I can't guarantee their accuracy (in fact, I can guarantee that some of them have a few errors), but they should be a good starting point for people interested in going a little deeper into information theory, or knowing the exact details of specific formats.

Links

Links & references

Compression theory

http://en.wikipedia.org/wiki/Information_theory
http://en.wikipedia.org/wiki/Data_compression
http://en.wikipedia.org/wiki/Information_entropy
http://en.wikipedia.org/wiki/Kolmogorov_complexity
http://en.wikipedia.org/wiki/Redundancy_(information_theory)

Lossless compression

http://en.wikipedia.org/wiki/Run-length_encoding
http://en.wikipedia.org/wiki/LZ77
http://en.wikipedia.org/wiki/GIF
http://en.wikipedia.org/wiki/Huffman_coding
http://en.wikipedia.org/wiki/Arithmetic_coding
http://www.data-compression.info/Algorithms/AC/index.htm
http://www.maximumcompression.com/algoritms.php
http://www.vectorsite.net/ttdcmp_1.html
http://en.wikipedia.org/wiki/PNG
http://en.wikipedia.org/wiki/FLAC
http://en.wikipedia.org/wiki/Wavpack

Lossy compression

http://en.wikipedia.org/wiki/Color_space
http://en.wikipedia.org/wiki/YCbCr
http://en.wikipedia.org/wiki/Chroma_subsampling
http://en.wikipedia.org/wiki/Quantization_(image_processing)
http://en.wikipedia.org/wiki/Quantization_(signal_processing)
http://en.wikipedia.org/wiki/Transform_coding
http://en.wikipedia.org/wiki/Discrete_cosine_transform
http://en.wikipedia.org/wiki/JPEG
http://en.wikipedia.org/wiki/Audio_data_compression
http://en.wikipedia.org/wiki/Modified_discrete_cosine_transform
http://en.wikipedia.org/wiki/MP3
http://en.wikipedia.org/wiki/Wavelet
http://en.wikipedia.org/wiki/JPEG_2000
http://jpeg.org/

Temporal compression

http://en.wikipedia.org/wiki/Delta_encoding
http://en.wikipedia.org/wiki/MPEG
http://www.chiariglione.org/mpeg/
http://en.wikipedia.org/wiki/Video_compression_picture_types
http://en.wikipedia.org/wiki/Motion_Compensation
http://www.bbc.co.uk/rd/pubs/papers/paper_14/paper_14.shtml (MPEG-2)
http://dvd-hq.info/dvd_compression.php
http://dvd-hq.info/faq.php
http://www.microsoft.com/taiwan/whdc/archive/TempRate1.mspx (temporal rate conversion)


Copyright

Copyright