Last updated: May 2, 2026

In video coding, temporal scalability is the option to decode only some of the frames in a video stream instead of the whole stream. This enables an SFU for example to reduce the bitrate sent towards viewers who doesn’t have enough bitrate or CPU to handle the whole stream. It also assists devices that miss a packet to continue decoding the stream partially until an intra-frame is received.

Temporal scalability is one of the scalability aspects usually attributed to SVC. It is available for WebRTC in VP8 without the implementation of SVC.

What is Temporal Scalability?

Temporal scalability is a concept that is crucial to the smooth operation of video streams, especially in the context of WebRTC and group video conferencing. To understand temporal scalability, we first need to look at the structure of video streams.

A video stream is composed of different types of frames. There are I-frames, or intra-frames, which contain all the data about a specific frame, much like how a JPG stores data about an image. Then there are P-frames, or partial frames, which store only the changes from one frame to the next, relying on previous frames for the full picture. Usually, an I-frame will be followed by many consecutive P-frames, each dependent on one another, creating a long dependency chain.

The Challenge of Packet Loss

This system of dependent P-frames works efficiently as long as the network is stable. However, when packet loss occurs, it can disrupt the entire chain of frames, since each P-frame is dependent on the last. To resolve this, a new I-frame needs to be transmitted, which can be bandwidth-intensive.

How Temporal Scalability Works

Temporal scalability offers a solution to this problem. It introduces the concept of layering frames, typically referred to as L0 and L1. Let’s say we’re transmitting at 30 frames per second. We start with a keyframe, and then an L0 frame that depends on this keyframe. The next frame, an L1 frame, will depend on both the keyframe and the L0 frame. This chain of dependencies continues, allowing for a structured hierarchy of frames.

These layer counts surface in code as SVC scalability modes like L1T2 and L1T3, configured via RTCRtpEncodingParameters.scalabilityMode.

The Benefits of Temporal Scalability

The real advantage of temporal scalability is the flexibility it provides. For instance, if we want to reduce the frame rate from 30 to 15 frames per second, we can simply drop all L1 frames, maintaining the dependency chain between the L0 frames. This can be done by the SFU, enabling it to decide which viewers receive which stream, reducing the stream to 15 frames per second and by extension the bitrate to those who have lower bandwidth available to them.

This ability to build a dependency tree within the encoder and selectively drop frames to adjust the frame rate is what defines temporal scalability.

Temporal Scalability in Different Codecs

Temporal scalability is available in VP8, but only in simulcast. In VP9 and AV1. To some extent, it is also available in H.264. You’ll find SVC (Scalable Video Coding), which includes temporal scalability as part of its capabilities. Interestingly, Google has modified the implementation in VP8 from three temporal layers to just two, which has implications for the flexibility and quality of streams.

Impact on Selective Forwarding Units (SFU)

The introduction of temporal scalability significantly enhances the capabilities of an SFU. Without these techniques, an SFU would have to forward whatever it receives. However, with simulcast and temporal scalability, the number of alternatives available to an SFU increases.

For example, from three different streams at varying bitrates, we can now effectively have six options by dropping frames to reduce the frame rate from 30 to 15 frames per second. This added flexibility allows for larger calls with higher quality than would be possible otherwise.

Temporal scalability vs simulcast vs full SVC

Temporal scalabilitySimulcastFull SVC (spatial + temporal)
What variesFrame rate onlyEntire stream (resolution + fps + bitrate)Resolution + frame rate + quality
Encoder sends1 stream with dependent layers2-3 separate streams1 stream with spatial and temporal layers
SFU dropsL1 frames to halve the frame rateEntire streamsIndividual spatial or temporal layers
Codec supportVP8, VP9, AV1All codecsVP9, AV1
Overhead vs. single streamLow (5-10% extra bitrate)High (~2x for 3 streams)Medium
Best forHandling variable bandwidth within a single streamBroad compatibility, maximum flexibilityLarge-scale conferencing with VP9/AV1

In practice for most WebRTC implementations: temporal scalability (VP8 with 2 temporal layers, or VP9 L1T3) is the first step, simulcast is the most widely deployed option (usually, simulcast streams will also enable and make use of temporal scalability when available), and full spatial SVC is reserved for large SFU-based calls where VP9 or AV1 is guaranteed.

Looking to learn more about WebRTC? 

Check my WebRTC training courses

About WebRTC Glossary

The WebRTC Glossary is an ongoing project where users can learn more about WebRTC related terms. It is maintained by Tsahi Levent-Levi of BlogGeek.me.