IP networks are “jittery” in their nature.
When packets are sent, it is not guaranteed that they will be delivered in the same time gaps that were sent with.
Many voice codecs, for example, generate and send audio frames 50 times a second, or every 20 milliseconds. This means that on the receiving end, we expect to receive 50 new packets every second, at an interval of 20 milliseconds between each packet (20 milliseconds x 50 times = 1,000 milliseconds = 1 second). If this were the case, then it would be a simple matter of decoding each incoming packet and playing it back immediately through the device’s speakers.
The thing is, that even if we do send a new packet every 20 milliseconds, modern IP networks don’t guarantee that these will be received at 20 millisecond intervals on the receiving end. The difference, or deviation, from the expected 20 millisecond interval is called jitter.
How jitter affects media quality
The higher the jitter, the more it affects media quality in WebRTC.
Since voice and video are time-sensitive, when media packets are received, they need to be collected, reordered and then timed based on the sequence and differences in which they were generated and not the sequence and timing in which they were received. This is handled by the Jitter Buffer in the Media Engine. WebRTC has its own implementation of a jitter buffer that takes into consideration the network’s latency, any observed packet losses, the exhibited jitter and the “distance” between the incoming audio and video packets since it needs to lip-sync them as well.