Time for another WebRTC Basics: Video Codecs
I’ve been yapping about video codec more than once here on this blog. But what is it exactly?
If you’re a web developer and you are starting to use WebRTC, then there’s little reason (until now) for you to know about it. Consider this your primer to video coding.
A video codec takes the raw video stream, which can be of different resolution, color depth, frame rate, etc. – and compress it.
This compression can be lossless, where all data is maintained (so when you decompress it you get the exact same content), BUT it is almost always going to be lossy. The notion is that we can lose data that our human eye doesn’t notice anyway. So when we compress video, we take that into account, and throw stuff out relative to the quality we wish to get. The more we throw – the less quality we end up with.
The video codec comes in two pieces:
- Encoder – takes the raw video data and compresses it
- Decoder – takes the compressed data created by an encoder and decompresses it
The decoded stream will be different from the original one. It will be degraded in its quality.
The Decoder is the Spec
The thing many miss is that in order to define a video codec, the only thing we have is a specification for a decoder:
Given a compressed video stream, what actions need to take place to decompress it.
There is no encoder specification. It is assumed that if you know how the compressed result needs to look like, it is up to you to compress it as you see fit. Which brings us to the next point.
Generally speaking, decoders will differ from each other by their performance: how much CPU they take to run, how much memory they need, etc.
The Encoder is… Magic
Or more like a large set of heuristics.
In a video codec, you need to decide many things. How much time and effort to invest in motion estimation, how aggressive to be when compressing each part of the current frame, etc.
You can’t really get to the ultimate compression, as that would take too long a time to achieve. So you end up with a set of heuristics – some “guidelines” or “shortcuts” that your encoder is going to take when he compresses the video image.
Oftentimes, the encoder is based on experience, a lot of trial and error and tweaking done by the codec developers. The result is as much art as it is science.
Encoders will differ from each other not only by their performance but also by how well they end up compressing (and how well can’t be summed up in a single metric value).
A large piece of what a codec does is brute force.
As an example, most modern codecs today split an image into macroblocks, each requiring DCT. With well over 3,000 macroblocks in each frame of 720p resolution that’s a lot that need to get processed every second.
Same goes for motion estimation and other bits and pieces of the video codec.
To that end, many video codec implementations are hardware accelerated – either the codec runs completely by accelerated hardware, or the ugly pieces of it are, with “software” managing the larger picture of the codec implementation itself.
It is also why hardware support for a codec is critical for its market success and adoption.
A video codec doesn’t work in a void. Especially not when the purpose of it all is to send the video over a network.
Networks have different characteristics of available bandwidth, packet loss, latency, jitter, etc.
When a video encoder is running, it has to take these things into account and compensate for them – reducing the bitrate it produces when there’s network congestion, reset its encoding and send a full frame instead of partial ones, etc.
There are also different implementations for a codec on how to “invest” its bitrate. Which again brings us to the next topic.
Different Implementations for Different Content Types (and use cases)
Not all video codec implementations are created equal. It is important to understand this when picking a codec to use.
When Google added VP9 to YouTube, it essentially made two compromises:
- Having to implement only a decoder inside a browser
- Stating the encoder runs offline and not in real-time
Real-tme encoding is hard. It means you can’t think twice on how to encode things. You can’t go back to fix things you’ve done. There’s just not enough time. So you use single-pass encoders. These encoders look at the incoming raw video stream only once and decide upon seeing a block of data how to compress it. They don’t have the option of waiting a few frames to decide how to compress best for example.
Your content is mostly static, coming from a Power Point presentation with mouse movements on top? That’s different from a head-shot video common in web meetings, which is in turn different than the latest James Bond Spectre trailer motion.
And in many ways – you pick your codec implementation based on the content type.
A Word about WebRTC
WebRTC brings with it a huge challenge to the browser vendors.
They need to create a codec that is smart enough to deal with all these different types of contents while running on variety of hardware types and configurations.
From what we’ve seen in the past several years – it does quite well (though there’s always room for improvement).
Next time you think why use WebRTC and not build on your own – someone implementing this video codec for you is one of the reasons.
So… which of these video codecs should you use in your application? Here’s a free mini video course to help you decide.