What’s the Best Size for a WebRTC SFU Media Server?

08/01/2019

Small, Medium, Big or Extra Large? How do you like your WebRC SFU Media Server?

I just checked AWS. If I had to build the most bad-ass, biggest, meanest, scalest, siziest server for WebRTC. One that can handle gazillions of sessions, I’d go for this one:

A machine to drool over… Should buy such a toy to write my articles on.

Or should I go for the biggest machine out there?

I did a round-up of some of the people who develop these SFUs. And guess what? None of them is ordering the XL machine.

They go for a Medium or Medium Well. Or should I say Medium Large?

Media servers, Signaling, NAT traversal – do you know what it takes to install and manage your own WebRTC infrastructure? Check out this free video course on the untold story of the WebRTC servers backend.

Anyways – here are a few things to think about when picking a machine for your SFU:

Going BIG on your SFU

As big as they come that’s how big you wanna take them.

We called it scale up in the past. Taking the same monolith application and put it on a bigger machine to get more juice out of it.

It’s not all bad, and there are good reasons to go that route with a media server:

Managing less machines

If one big machine does the work of 10 smaller machines, then all in all, you’ll need 1/10 the number of machines to handle the same workload.

In many ways, scaling is non-linear. To get to linear scaling, you’ll need to put a lot of effort. Different bits and pieces of your architecture will start breaking once you scale too much. In this sense, having less machines to manage means less scaling headaches as well.

Having bigger rooms

Group calling is what we’re after with media servers. Not always, but mostly.

Getting 4 people in a room is easy. 20? Harder. 500? Doable.

The bigger the rooms, the more you’ll need to start addressing it with your architecture and scale out strategies.

If you take smaller machines, say ones that can handle up to 100 concurrent users, then getting any group meeting to 100 participants or more is going to be quite a headache – especially if the alternative is just to use a bigger machine spec.

The bigger the rooms you want, the bigger the machines you’ll aim for (up to a point – if you want to cater for 100+ users in a room, I’d aim for other scaling metrics and factors than just enlarging the machines).

Less fragmentation

Similar to how you fit chunks of memory allocations into physical memory, fitting group sessions into media servers, and maybe even cascading them across machines will end up with fragmentation headaches for you.

Let’s say some of your meetings are really large and most are pretty smallish. But you don’t really know in advance which is which. What would be the best approach of starting to fit new rooms into existing media servers? This isn’t a simple question to answer, and it gets harder the smaller the machines are.

Simpler architecture (=no cascading)

If you are setting up the media server for a specific need, say catering for the needs of a hospital, then the size is known in advance – there’s a given number of hospital beds and they aren’t going to expand exponentially over night. The size of the workforce (doctors and nurses) is also known. And these numbers aren’t too big. In such a case, aiming for a large machine, with an additional one acting as active/passive server for high availability will be rather easy.

Aiming for smaller machines might get you faster to the need to scale out in your architecture. And scaling out has its own headaches and management costs.

Simpler

Bigger machines are going to be simpler in many ways.

Going small on your SFU

This is something I haven’t thought about as an alternative – at least not until a few years ago when I was helping a client in picking a media server for his cloud based service. One of the parameters that interested him was how small was considered too small by each media server vendor – trying to understand the overhead of a single media server process/machine/application.

I asked, and got good answers. I since decided to always look at this angle as well with the projects I handle. Here’s where smaller is better for WebRTC media servers:

Easier to upgrade

I dealt with upgrading WebRTC media servers in the past.

There are two things you need to remember and understand:

  1. WebRTC moves fast (and breaks things while doing so)
  2. You’ll need to update your backend rather frequently, including your media servers

The most common approach to upgrades these days is to drain media servers – when wanting to upgrade, block new sessions from going into some of the media servers, and once the sessions the are already handling are closed, kill and upgrade that media server. If it takes too long – just kill the sessions.

Smaller machines make it easier to drain them as they hold less sessions in them to begin with.

Having more machines also means you can mark more on them in parallel for draining without breaking the bank.

Blast radius of crashes

This is what started me on this article to begin with.

I took the time to watch Werner Vogels’s keynote from AWS re:Invent which took place November 2018. In it, he explains what got AWS on the route to build their own databases instead of using Oracle, and why cloud has different requirements and characteristics.

Here’s what Werner Vogels said:

With blast radius we mean that if a failure happens, and remember: everything fails all the time. Whether this is hardware or networking or transformers or your code. Things fail. And what you want to achieve is that you minimize the impact of such a failure on your customers.

Basically, if something fails, the minimum set of customers should be affected, if that’s the case.

Everything fails all the time.

And we do want to minimize who’s affected by such failures.

The more media servers we have (because they are smaller), the less customers will be affected if one of these servers fail. Why? Because our blast radius will be smaller.

CPU utilization

Here’s something about most modern media servers you might not have known – they don’t eat up CPU. Well… they do, but less than they used to a decade ago.

In the past, media servers were focused on mixing media – the industry was rallied around the MCU concept. This means that all video and audio content had to be decoded and re-encoded at least once. These days, it is a lot more common for vendors to be using a routing model for media – in the form of SFUs. With it, media gets routed around but never decoded or encoded.

Media servers, Signaling, NAT traversal – do you know what it takes to install and manage your own WebRTC infrastructure? Check out this free video course on the untold story of the WebRTC servers backend.

In an SFU, network I/O and even memory gets far more utilized than the CPU itself. When vendors go for bigger machines, they end up using less of the CPU of the machines, which translates into wasted resources (and you are paying for that waste).

At times, cloud vendors throttle network traffic, putting a limit at the number of packets you can send or receive from your cloud servers, which again ends up as putting a limit to how much you can push through your servers. Again, causing you to go for bigger machines but finding it hard to get them fully utilized.

Smaller machines translates into better CPU utilization for your SFU in most cases.

Number of Cores/CPUs and Your SFU’s Architecture

Big or small, there’s another thing you’ll need to give your thought to – and that’s the architecture of the media server itself.

Media servers contain two main components (at least for an SFU):

  1. Control/signaling
  2. Media routing

Sometimes, they are coupled together, other times, they are split between threads or even processes.

In general, there are 3 types of architectures that SFUs take:

  1. Have a single process handle both control and media; doing it in a multithreaded mode
  2. Have separate processes that can scale out, running each on its own machine or thread
  3. Decoupling control and media and having both of them scale out independently of each other

Me? I like the third alternative for large scale deployments. Especially when each process there is also running a single thread (I don’t really like multithreaded architectures and prefer shying away from them if possible).

That said, that third option isn’t always the solution I suggest to clients. It all depends on the use case and requirements.

In any case, you do need to give some thought to this as well when you pick a machine size – in almost all cases, you’ll be used a multi-core multi-threaded machine anyway, so better make the most of it.

How Do You Like Your SFU?

Back to you.

Media servers, Signaling, NAT traversal – do you know what it takes to install and manage your own WebRTC infrastructure? Check out this free video course on the untold story of the WebRTC servers backend.

Responses

Philipp Hancke says:
January 8, 2019

Great post! I like the “blast radius” concept.

On AWS you’ll usually pay at least 10x as much for the network traffic as you do for the CPU, making the cost of additional machines relatively irrelevant. Which will also affect decisions like how to deploy stuff. Of course, once you realize how expensive AWS is for running media servers you will need to reconsider all that 😉

Reply
    Tsahi Levent-Levi says:
    January 8, 2019

    Thanks. I liked the way Vogel explained it.

    And definitely thanks for the cost angle – somehow I failed to mention it or even think about it when writing this article.

    Reply
David MacLaren says:
January 8, 2019

I was trying to avail myself of your kind offer…

“…Check out this free video course on the untold story of the WebRTC servers backend…”

I switched all the script/tracking/ad-blocking widgets but the links don’t seem to go anywhere..

This whole page is helpful btw.. thanks as usual!

Reply
David MacLaren says:
January 8, 2019

this is from chrome where your link works.. does not work for me in either firefox regular or dev.. does work in opera.

also, when I tried to reply to your comment in firefox I was dished up a WP error.. “Comment is a spam.” 🙂

anyway all good.. I signed up but my normal address did not receive anything so i used a gmail account and all was fine..

Reply
Pavel Punsky says:
January 12, 2019

Not all components of the machine scale at the same rate – biggest machine has only x4 faster network than the largest machine (according to advertised numbers). In reality, network performance is capped by amount of cores processing network stack, the fact that webrtc is mostly UDP packets, etc.

So that c5.18xlarge != 9 * c5.2xlarge.

Reply
Real Time Weekly #263: January 14, 2019 says:
January 14, 2019

[…] What’s the Best Size for a WebRTC SFU Media Server?   […]

Reply
Eran Kalmanson says:
February 9, 2019

Great article Tsahi! I have just made the whole team read it, and are getting a quiz on Monday. I’ll update you on the results.

Reply

Comment