WebRTC Multiparty Video Alternatives, and Why SFU is the Winning Model


It’s the money stupid.

We all love to hate the model of an MCU (besides those who sell MCUs that is).

There are in general 3 main models of deploying a multiparty video conference:

  1. Mesh – where each participant sends his media to all other participants
  2. MCU – where a participant is “speaking” to a central entity who mixes all inputs and sends out a single stream towards each participant
  3. SFU – where a participant sends his media to a central entity, who routes all incoming media as he sees fit to participants – each one of them receiving usually more than a single stream

I’ve taken the time to use testRTC to show the differences on the network between the 3 multiparty video alternatives on the network.

To sum things up:

  • Mesh fails miserably relatively fast. Anything beyond 3 isn’t usable anywhre in a commercial product if you ask me
  • MCU seems the best approach when it comes to load on the network
  • SFU is asymmetric in nature – similar to how ADSL is (though this can be reduced, just not in Jitsi in the specific scenario I tried)

This being the case, how can I even say that SFU is the winning model for WebRTC?

It all comes down to the cost of operating the service.

Here’s what an MCU does in front of each participant:

How media gets processed by an MFU

How media gets processed by an MFU

Here’s what an SFU does in front of each participant:

How media gets processed by an SFU

How media gets processed by an SFU

To make things easy for you, I’ve marked with colors varying from green to red the amount of effort it puts on a CPU to deal with it.

The most taxing activity in an MCU is the encoding and decoding of the video. With the current and upcoming changes in video and displays, this isn’t going to lessen any time soon:

  • Google just switched to VP9, which takes up more CPU
  • 4K displays and cameras are becoming a reality. 8K is being discussed already. This means 4 times the resolutions of full HD

If anything – things are going to get worse here before they get any better.

It is no surprise then that MCUs scale on single machines in the 10’s of ports or low 100’s at best; while SFUs scale on single machines in the 1,000’s of ports or low 10,000’s.

Which brings us to two very important aspects of this:

  1. Price per port, where an SFU will ALWAYS be lower than MCU – by several factors
  2. Deployment complexity

The first reason is usually answered by people that if you want quality – you need to pay for it. Which is always true. Until you start reminding yourself that video calling today is priced at zero for the most part.

The second reason isn’t as easy to ignore. If you aim for cloud based services needing to serve multiple customers, your aim is to go to 10,000 or more parallel sessions. Sometimes millions or more. Here would be a good time to remind you that WhatsApp crossed the billion monthly active users and most messaging services become interesting when they cross 100 million monthly active users.

With such numbers, placing 100 times more machines to support an MCU architecture instead of an SFU one is… prohibitive. There are more costs that needs to be factored in, such as power consumption, rack space and higher administration costs.

The end result?

An SFU model is by far the most popular deployment today for WebRTC services.

Does it fit all use cases? No

Will it fit your use case? Maybe

Do customers care? No


Need to pick an open source WebRTC media server framework for your project? Check out this free selection worksheet.


Robert Welbourn says:
March 7, 2016

One other thing to bear in mind is that, with an SFU, it’s a relatively easy thing to alter the layout of the web page or mobile app to suit the service provider’s needs. With an MCU, you’re pretty much stuck with whatever screen layout the MCU vendor offers, such as 1 large video + 5 small.

Philipp Hancke says:
March 8, 2016

There is no such thing as an “MCU architecture” or an “SFU architecture”. Your system architecture will include one or both building blocks. If you only need a SFU: lucky you!

Zohar Babin says:
June 23, 2017

Philipp is correct, it’s never just one or the other – outside of peer 2 peer one on one call, most use-cases will require a mix of both architecture.

A typical “MCU” focused vendor will decode all incoming streams and merge all decoded streams into one in one encoding pass (one per template times ABR flavors). Then clients will be able to chose between templates (usually there are no more than 3: active speaker, grid, and active + screenshare). It will then also include routing in that it will decide which client should receive which flavor and template. It will be far more efficient on bandwidth consumption (each client only upload once, and download once), and indeed little more taxing on CPU.. here’s why it’s not “by far” more encoding:

A typical “SFU” doesn’t just re-route streams, because if it did, you’ll always end up with poor user experience with many interruptions due to network load. If your conference includes more than 2 participants, in a pure SFU architecture each participant will have to upload once (or more in case there’s ABR) and download several streams (one per participant). Such a scenario will be very taxing on bandwidth and since in most cases upload connections are slow, it’s also impractical to achieve proper ABR with pure SFU. Hence, a more common SFU implementation will also include decoding and encoding, in-order to create the ABR flavors of each participant (which means transcoding pipe per client per flavor), and then it will route the proper flavor to each client. It will still be heavier on bandwidth due to each client receiving multiple streams instead of one. Additionally, the SFU architecture is taxing on the client side rendering side too, and thus requires further specialized applications to be installed on the client side.. as most mobile browsers will crash in the face of having to play multiple video streams concurrently.
In addition, another downside of an SFU is that connecting to devices and applications such as room systems and/or broadcasting to sites like YouTube and Facebook will always require further encoding (usually called “Gateways”) in order to mix all streams into one rtmp/rtsp/webrtc upload.

The exception to above described SFU challenges is ; SVC based SFU. Bringing an important advantage over pure SFU/MCU: It requires less encoding on the server (MCU) for ABR or less upload BW is needed (SFU). In an SVC based SFU each client uploads an “onion of video quality layers” (Read more; http://info.vidyo.com/rs/vidyo/images/WP-Vidyo-SVC-Video-Communications.pdf), the server is then able to decide which layers to send to each client depending on the client, screen size, network connection, etc. This makes the server significantly more efficient, since it’s able to deliver ABR video without having to perform transcoding on the server. Lastly, even with SVC, supporting other devices/sites will require a Gateway for mixing all streams into one (including all browsers that don’t support SVC), and it often requires the use of specialized applications due to SVC being more complex to implement and rather new in browsers.

mekya says:
January 29, 2019

Another approach is Adaptive SFU

| -> High Quality Encode -> |
receive -> decode ->| | -> Send according to bandwidth
|-> Low Quality Encode -> |

We use this approach at Ant Media Server https://antmedia.io
In addition it supports traditional SFU as well.

Krishna says:
June 26, 2020

How to find a particular site is using Muc or SFU or Mesh?

    Tsahi Levent-Levi says:
    June 26, 2020

    On a 3-way or bigger group call, do the following:

    * Open chrome://webrtc-internals
    * Check how many incoming and outgoing voice/video channels you see
    * 1 incoming and 1 outgoing voice/video channels that is an MCU
    * 1 outgoing and many incoming voice/video channels that is an SFU
    * many incoming and many outgoing voice/video channels that is Mesh

    Not always, but close enough.