Mixing is a multiparty conferencing architecture where a central server (MCU) combines all participants’ media streams into a single composite output.
How audio mixing works
Audio mixing is computationally inexpensive. The MCU decodes all incoming audio streams, adds the waveforms together (subtracting each participant’s own audio to prevent echo), and sends a single mixed audio stream to each participant. This is still common even in SFU-based architectures for audio-only mixing.
How video mixing works
Video mixing (compositing) is far more expensive:
- Decode all incoming video streams
- Scale and position each stream in a layout (grid, spotlight, etc.)
- Render the composite frame
- Re-encode and send to each participant
This makes video mixing the most CPU-intensive conferencing architecture, which is why routing (SFU) architectures have largely replaced mixing for video.
When mixing is still used
- Audio conferences: Audio mixing is efficient and produces clean output
- Legacy endpoints: SIP/PSTN devices that can only handle a single stream
- Live broadcast: When broadcasting via a social network (Facebook Live, YouTube Live, etc), there is a need to create a single media stream and send over, usually via an RTMP interface
- Recording: Producing a single mixed recording is simpler than compositing separate streams later
See also: routing (SFU architecture), mesh (P2P architecture)


