What are the Challenges of DIY your WebRTC SFU?

January 25, 2016

S doesn’tΒ stand for Simple in WebRTC SFU.

What are you doing about your WebRTC SFU requirements?
What are you doing about your WebRTC SFU requirements?

I have noticed recently that more and more companies are attempting the creation of their own SFU. SFU stands for Selective Forwarding Unit, and it is by far the most popular and cost efficient architecture today for multiparty video with WebRTC. With it, all participants send their video to a single entity (usually in multiple resolutions/bitrates), and that single entity decides (selectively) how to route the incoming video to all the participants.

One such popular framework is the Jitsi Videobridge.

Up until today, an SFU for WebRTC was rather simplistic. You had VP8 to contend with as a developer but that’s about it. Life was good. You built your service and mostly whined about incompatibility between browsers.

Things have changed.

Here are a few things that you need to take into consideration when you either build your own WebRTC SFU or adopt/acquire one from others:

  • Do you use VP8 or VP9 in your SFU?
    • Google is already adding VP9 to Chrome
    • How long will it take until it catches on for some use cases?
    • VP9 is a better codec, so why not use it?
  • Can it support multiple codecs simultaneously?
    • Before the end of this year, we will have VP8, VP9 and H.264 available to us in browsers
    • Not all browsers will support them all
    • VP8 seems like the lowest common denominator at the moment
    • This may change to H.264 when Microsoft Edge and Chrome support it though
    • An SFU supporting only VP8 will start looking old pretty fast – and won’t work on Edge
    • Staying in H.264/VP8 land will not perform as well as VP9 in terms of perceived quality for the users
    • So it would be beneficial to be able to use whatever is available at the best possible quality
    • Which makes it a lot more complex for an SFU – more decisions to make with more data points to take into consideration
  • Mobile
    • Mobile doesn’t like multiple, simultaneous video decoders
    • Especially not when this is hardware accelerated – not all smartphone hardware can work this way
    • For mobile devices, you just might want to select a single video stream to send it – or combine multiple video streams to a single one (which looks more like an MCU, but who cares?)
  • Broadcast
    • In many new use cases, people want to have multiple participants chatting, but many more passively viewing
    • Can an SFU scale there? And if it can’t, what will you do instead?

Like any other technology, once you get down to it, there’s more to do than the proof of concept. Consider these aspects at the beginning of your project – and not when you need to seriously rethink your architecture.

You may also like

Two years of WebRTC Insights

Two years of WebRTC Insights

Your email address will not be published. Required fields are marked

  1. Erm… SFUs are supposed to be (relatively) codec-agnostic. You don’t need to support VP9 because you’re not decoding packets. The reality is slightly more complicated as usual. Choosing the right codec depending on the participants capabilities and switching dynamically is the complicated issue here, but that needs to be in the command & control logic.

    Building an SFU is not something to be undertaken lightly. It may be easy to have something running initially, but then you hit all kinds of problems from the list you mention. And more…

    1. While the basic principles are simple, indeed there are a lot of details that matter. Specifically for the need to know the codec, this is definitely the case for scalable codecs, before generic frame marking is available. By knowing the codec, you can get layering information which is crucial for the SFU to do its magic. A simulcast SFU, on the other hand, can indeed be mostly agnostic.

  2. Tsahi,

    As you point out, adding codecs adds complexity.

    When different devices have differing codec support, what strategy would you apply to avoid the need for transcoding? It seems to me that H.264 will become the lowest common denominator, negating some of the benefits of advanced codecs like VP9. On the other hand, an SFU-mediated conference might support VP9, while also having desktop clients encode a low quality, low bandwidth H.264 stream for mobile clients. (A sort of poor man’s SVC, if you will.)

    1. Robert,

      Your guess here is as good as mine (and probably even better). There are no good alternatives that will be elegant.

      The truth is that this is going to stay messy with different strategies used by different vendors.

  3. What’s the reasons for SFU to support Codecs?

    In my opinion, one of the reasons why you need to take care of Codecs in SFU is recording. Some customers want to audio and video to be recorded in SFU, which requires decrypting of traffic and processing of Codec dependent things. The other one I know is the one Fippo wrote: Choosing the right codec depending on the participants capabilities.

    Are there other reasons?

    1. iwashi, codecs differ from one another in the quality they offer, their network resiliency and their ability to accommodate some advanced SFU related capabilities. There are also a lot of fine tuning that an SFU needs to undergo to optimize for the network – and these are highly dependent on the codec.

  4. (Disclaimer: I have being working on transcoding MCUs since early 2000s, so my opiniones are very biased)

    Let’s face it, SFUs are a very good theoretical solution, that don’t work on real life. Well, at least not in this real life. Check the preconditions that SFU require in order to properly work:

    -All endpoints must support same codec
    -Codec must support temporal/spatial scalability (Either SVC or simulcasting)
    -Endpoints need to have enough CPU power to handle it (so, hardware based codecs are a must for mobile/tablets)

    As tsahi points out in the article none of both requirements fits nicely on webrtc today and in the near future.

    Building an SFU is quite easy, but you are putting most of the effort on the endpoints. But as you don’t control the endpoints, there is nothing you can do to fix issues on your server.

    Or if you do, then you have to move to something that looks much more a transcoding MCU than a SFU

    Given said that, main advantage of SFUs is that they are fairly inexpensive, so caveats and limitations may be fine for free users, but obviously not enough for paying users requiring QA.

    1. Sergio,

      Thanks for sharing. While I agree there are complexities here, for the most part, the majority of multiparty WebRTC solutions out there today are predominantly SFU based – and some of them are commercial offerings.

      The realities of the web and the economics around most WebRTC business models puts an SFU as the only viable solution – one that at times outshine MCUs (Hangouts is a great example).

      1. Hangouts requires a plugin for Firefox, so it is cheating a bit.. πŸ˜‰

        Also, I have received quite a lot of feedback from people which the SFU model is not working for them at all.

        As you say, business model are driving the technical solution, which is perfectly fine. But that’s what I am saying that SFUs are not the best solution for webrtc, is the best one that you can afford πŸ˜‰

        IMHO, the future is an hybrid SFU-MCU model, at least until we don’t have webrtc hardware based VP9 handsets everywhere..

        1. The value for SFU is a lightweight core , the intelligence reside at client side, due to the fact everyone has stronger CPU power and bandwidth at hand.
          With lightweight dummy core and smart client ,this would give developer great possibilities , not limit them as current MCU, softswitch etc.

          For example, every client send out 2 streams, h264+VP9, and SFU selective forward proprate codec to other clients, this would easily solve the the codec compatibility issue.

          With smart client, you should be smarter…, I have a blog for smart client things on my blog

          1. I am not sure that encoding two different video codecs in the client and sending them both to the SFU is something I’d call best practice or smart, but I guess it may have its uses.

            Thanks for sharing.

          2. yup, it might not be best practice , neither smart. This is not my point.

            Client is getting more powerful and intelligent, it is the client which is smart. Smart client greatly reduce the needs to have a powerful/smart core . In the context of MCU, xx-coding is not a necessary feature for MCU/SFU.

          3. I totally agree with Choubb. Experience in other domains has shown that, to achieve scale (think a cloud service), you want to push the complexity to the edge. It’s much easier to add complexity there, vs. the server. A useful example is web servers and web browsers. The web server (ignoring scripting) just fetches content; it’s the browser that does the heavy lifting of decoding content, compositing and rendering according to HTML/CSS. Imagine how difficult it would be to scale the web if web servers dished pre-rendered pages. With an SFU, all decoding and compositing is done at the client and not the server.

    2. Fixing stuff in the server removes pressure on the endpoint to fix stuff. Which did not exactly work out great if you look at the last two decades of SIP.

      SFUs are not a silver bulllet, they are a building block in your overall architecture (which may include MCUs, mediaservers for recording or other processing, etc). The separation of the focus logic from the SFU was the main reason for me to choose the JVB a while back.

  5. Sigh … how is it that we are so attached to the past that we’d even prefer expensive and ugly over nice and inexpensive?

    OK, let’s get to work then:

    * Mobile doesn’t like SFU.

    Does it now? I’ll try to remember this next time I attend our 12 person standup from my mobile:


    * SFUs work in theory but not in real life

    Sergio, I suppose that’s why you are building one now :). I understand that investing 15 years of your life can making partial but still … never worked? When you take the combination of Hangouts, Janus (including Slack’s version) and all Jitsi installations today (including ones like HipChat, join.me, Highfive without even counting the thousands of others), then I am pretty sure that SFUs are hosting more conferences today than MCUs could have ever hoped to (thank you BlueJeans for saving some honor for the MCU model).

    * How would an SFU let you stream to an unlimited audience?

    Well, exactly the same way as anything else. One example from Jitsi’s yesterday community call:


    (not hundreds of watchers this time but I hope you don’t doubt YouTube’s ability to handle that).

    Here’s how you can have that yourself:


    * But what about Edge and VP9?

    I’ll give you this one is still an unknown. Am i worried? Not in the least. Remember: more conferences happening over SFUs today than any other architecture. Microsoft have shown themselves very open on the codec matter so I am confident we will have some nice surprises coming from there.

    Cheers! πŸ™‚

    1. Let me add my sigh here to Emil’s and Tsahi’s…

      I’ve been in the video communications technology since 1990, and we (Vidyo) created the very first SFU which has been on the market since 2008.

      You can see Vidyo’s SFUs working in real life, live, at CERN: http://avc-dashboard.web.cern.ch/Vidyo (this is a live portal of the currently deployed system). So yes, it does work in real life: the world’s largest particle physics laboratory relies on it. (I will refrain from further name dropping.)

      If that’s not enough, note that both Hangouts and Skype use versions of this. So this is very real.

      Does that mean that we don’t need transcoding anymore? Certainly not – there are tons of legacy (and not so legacy) gear out there that needs to be connected. Does that mean that you don’t need a CDN for streaming to thousands or millions of people? Of course you do.

      But for the vast majority of use cases, including many with WebRTC, you can just use an SFU and enjoy low complexity and – most significantly – very low delay. This does wonder to the user experience. Transcoding is only used as a last resort, not all the time.

      There is no silver bullet that solves everything. The SFU has proved to be a great new tool in the video communications arsenal, and today dominates in both commercial and open source implementations of multipoint video systems. I don’t think anyone can argue with this.

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}