How Many Users Can Fit in a WebRTC Call?

By Tsahi Levent-Levi

January 15, 2018

How many users?As many as you like. You can cram anywhere from one to a million users into a WebRTC call.

You’ve been asked to create a group video call, and obviously, the technology selected for the project was WebRTC. It is almost the only alternative out there and certainly the one with the best price-performance ratio. Here’s the big question: How many users can we fit into that single group WebRTC call?

Need to understand your WebRTC group calling application backend? Take this free video mini-course on the untold story of WebRTC’s server side.

At least once a week I get approached by someone saying WebRTC is peer-to-peer and asking me if you can use it for larger groups, as the technology might not fit for such use cases. Well… WebRTC fits well into larger group calls.

You need to think of WebRTC as a set of technological building blocks that you mix and match as you see fit, and the browser implementation of WebRTC is just one building block.

The most common building block today in WebRTC for supporting group video calls is the SFU (Selective Forwarding Unit). a media router that receives media streams from all participants in a session and decides who to route that media to.

What I want to do in this article, is review a few of the aspects and decisions you’ll need to take when trying to create applications that support large group video sessions using WebRTC.

Analyze the Complexity

The first step in our journey today will be to analyze the complexity of our use case.

With WebRTC, and real time video communications in general, we will all boil down to speeds and feeds:

Speeds – the resolution and bitrate we’re expecting in our service
Feeds – the stream count of the single session

Let’s start with an example.

Assume you want to run a group calling service for the enterprise. It runs globally. People will join work sessions together. You plan on limiting group sessions to 4 people. I know you want more, but I am trying to keep things simple here for us.

The illustration above shows you how a 4 participants conference would look like.

Magic Squares: 720p

If the layout you want for this conference is the magic squares one, we’re in the domain of:

You want high quality video. That’s what everyone wants. So you plan on having all participants send out 720p video resolution, aiming for WQHD monitors (that’s 2560×1440). Say that eats up 1.5Mbps (I am stingy here – it can take more), so:

Each participant in the session sends out 1.5Mbps and receives 3 streams of 1.5Mbps
Across 4 participants, the media server needs to receive 6Mbps and send out 18Mbps

Summing it up in a simple table, we get:

Resolution	720p
Bitrate	1.5Mbps
User outgoing	1.5Mbps (1 stream)
User incoming	4.5Mbps (3 streams)
SFU outgoing	18Mbps (12 streams)
SFU incoming	6Mbps (4 streams)

Magic Squares: VGA

If you’re not interested in resolution that much, you can aim for VGA resolution and even limit bitrates to 600Kbps:

Resolution	VGA
Bitrate	600Kbps
User outgoing	0.6Mbps (1 stream)
User incoming	1.8Mbps (3 streams)
SFU outgoing	7.2Mbps (12 streams)
SFU incoming	2.4Mbps (4 streams)

The thing you may want to avoid when going VGA is the need to upscale the resolution on the display – it can look ugly, especially on the larger 4K displays.

With crude back of the napkin calculations, you can potentially cram 3 VGA conferences for the “price” of 1 720p conference.

Hangouts Style

But what if our layout is a bit different? A main speaker and smaller viewports for the other participants:

I call it Hangouts style, because Hangouts is pretty known for this layout and was one of the first to use it exclusively without offering a larger set of additional layouts.

This time, we will be using simulcast, with the plan of having everyone send out high quality video and the SFU deciding which incoming stream to use as the dominant speaker, picking the higher resolution for it and which will pick the lower resolution.

You will be aiming for 720p, because after a few experiments, you decided that lower resolutions when scaled to the larger displays don’t look that good. You end up with this:

Each participant in the session sends out 2.2Mbps (that’s 1.5Mbps for the 720p stream and the additional 700Kbps for the other resolutions you’ll be simulcasting with it)
Each participant in the session receives 1.5Mbps from the dominant speaker and 2 additional incoming streams of ~300Kbps for the smaller video windows
Across 4 participants, the media server needs to receive 8.8Mbps and send out 8.4Mbps

Resolution	720p highest (in Simulcast)
Bitrate	150Kbps – 1.5Mbps
User outgoing	2.2Mbps (1 stream)
User incoming	1.5Mbps (1 stream) 0.3Mbps (2 streams)
SFU outgoing	8.4Mbps (12 streams)
SFU incoming	8.8Mbps (4 streams)

This is what have we learned:

Different use cases of group video with the same number of users translate into different workloads on the media server.

And if it wasn’t mentioned specifically, simulcast works great and improves the effectiveness and quality of group calls (simulcast is what we used in our Hangouts Style meeting).

Across the 3 scenarios we depicted here for 4-way video call, we got this variety of activity in the SFU:

	Magic Squares: 720p	Magic Squares: VGA	Hangouts Style
SFU outgoing	18Mbps	7.2Mbps	8.4Mbps
SFU incoming	6Mbps	2.4Mbps	8.8Mbps

Here’s your homework – now assume we want to do a 2-way session that gets broadcasted to 100 people over WebRTC. Now calculate the number of streams and bandwidths you’ll need on the server side.

How Many Users Can be Active in a WebRTC Call?

That’s a tough one.

If you use an MCU, you can get as many users on a call as your MCU can handle.

If you are using an SFU, it depends on a 3 different parameters:

The level of sophistication of your media server, along with the performance it has
The power you’ve got available on the client devices
The way you’ve architected your infrastructure and worked out cascading

We’re going to review them in a sec.

Same Scenario, Different Implementations

Anything about 8-10 users in a single call becomes complicated. Here’s an example of a publicly available service I want to share here.

The scenario:

9 participants in a single session, magic squares layout
I use testRTC to get the users into the session, so it is all automated
I run it for a minute. After that, it kills the session since it is a demo
It takes into account that with 9 people on the screen, reducing resolutions for all to VGA, but it allocates 1.3Mbps for that resolution
Leading to the browsers receiving 10Mbps of data to process

The media server decided here how to limit and gauge traffic.

And here’s another service with an online demo running the exact same scenario:

Now the incoming bitrate on average per browser was only 2.7Mbps – almost a fourth of the other service.

Same scenario. Different implementations.

What About Some Popular Services?

What about some popular services that do video conferencing in an SFU routed model? What kind of size restrictions do they put on their applications?

Here’s what I found browsing around:

Google Hangouts – up to 25 participants in a single session. It was 10 in the past. When I did my first-ever office hour for my WebRTC training, I maxed out at 10, which got me to start using other services
Hangouts Meet – placed its maximum number at 50 participants in a single session
Houseparty – decided on 8 participants
Skype – 25 participants
appear.in – their PRO accounts support up to 12 participants in a room
Amazon Chime – 16 participants on the desktop and up to 8 participants on iOS (no Android support yet)
Atlassian Stride and Meet Jitsi – 50 participants

Does this mean you can’t get above 50?

My take on it is that there’s an increasing degree of difficulty as the meeting size increases:

The CPaaS Limit on Size

When you look at CPaaS platforms, those supporting video and group calling often have limits to their meeting size. In most cases, they give out an arbitrary number they have tested against or are comfortable with. As we’ve seen, that number is suitable for a very specific scenario, which might not be the one you are thinking about.

In CPaaS, these numbers vary from 10 participants to 100’s of participants in a single sesion. Usually, if you can go higher, the additional participants will be view-only.

Key Points to Remember

Few things to keep in mind:

The higher the group size the more complicated it is to implement and optimize
The browser needs to run multiple decoders, which is a burden in itself
Mobile devices, especially older ones, can be brought down to their knees quite quickly in such cases. Test on the oldest, puniest devices you plan on supporting before determining the group size to support
You can build the SFU in a way that it doesn’t route all incoming media to everyone but rather picks partial data to send out. For example, maybe only a single speaker on the audio channels, or the 4 loudest streams

Sizing Your Media Server

Sizing and media servers is something I have been doing lately at testRTC. We’ve played a bit with Kurento in the past and are planning to tinker with other media servers. I get this question on every other project I am involved with:

How many sessions / users / streams can we cram into a single media server?

Given what we’ve seen above about speeds and feeds, it is safe to say that it really really really depends on what it is that you are doing.

If what you are looking for is group calling where everyone’s active, you should aim for 100-500 participants in total on a single server. The numbers will vary based on the machine you pick for the media server and the bitrates you are planning per stream on average.

If what you are looking for is a broadcast of a single person to a larger audience, all done over WebRTC to maintain low latency, 200-1,000 is probably a better estimate. Maybe even more.

Big Machines or Small Machines?

Another thing you will need to address is on which machines are you going to host your media server. Will that be the biggest baddest machines available or will you be comfortable with smaller ones?

Going for big machines means you’ll be able to cram larger audiences and sessions into a single machine, so the complexity of your service will be lower. If something crashes (media servers do crash), more users will be impacted. And when you’ll need to upgrade your media server (and you will), that process can cost you more or become somewhat more complicated as well.

The bigger the machine, the more cores it will have. Which results in media servers that need to run in multithreaded mode. Which means they are more complicated to build, debug and fix. More moving parts.

Going for small machines means you’ll hit scale problems earlier and they will require algorithms and heuristics that are more elaborate. You’ll have more edge cases in the way you load balance your service.

Scale Based on Streams, Bandwidth or CPU?

How do you decide that your media server achieved full capacity? How do you decide if the next session needs to be crammed into a new machine or another one or be placed on the current media server you’re using? If you use the current one, and new participants want to join a session actively running in this media server, will there be room enough for them?

These aren’t easy questions to answer.

I’ve see 3 different metrics used to decide on when to scale out from a single media server to others. Here are the general alternatives:

Based on CPU – when the CPU hits a certain percentage, it means the machine is “full”. It works best when you use smaller machines, as CPU would be one of the first resources you’ll deplete.

Based on Bandwidth – SFUs eat up lots of networking resources. If you are using bigger machines, you’ll probably won’t hit the CPU limit, but you’ll end up eating too much bandwidth. So you’ll end up determining the capacity available by way of bandwidth monitoring.

Based on Streams – the challenge sometimes with CPU and Bandwidth is that the number of sessions and streams that can be supported may vary, depending on dynamic conditions. Your scaling strategy might not be able to cope with that and you may want more control over the calculations. Which will lead to you sizing the machine using either CPU or bandwidth, but placing rules in place that are based on the number of streams the server can support.

–

The challenge here is that whatever scenario you pick, sizing is something you’ll need to be doing on your own. I see many who come to use testRTC when they need to address this problem.

Cascading a Single Session

Cascading is the process of connecting one media server to another. The diagram below shows what I mean:

We have a 4-way group video call that is spread across 3 different media servers. The servers route the media between them as needed to get it connected. Why would you want to do this?

#1 – Geographical Distribution

When you run a global service and have SFUs as part of it, the question that is raised immediately is for a new session, which SFU will you allocate for it? In which of the data centers? Since we want to get our media servers as close as possible to the users, we either have pre-knowledge about the session and know where to allocate it, or decide by some reasonable means, like geolocation – we pick the data center closest to the user that created the meeting.

Assume 4 people are on a call. 3 of them join from New York, while the 4th person is from France. What happens if the French guy joins first?

The server will be hosted in France. 3 out of 4 people will be located far from the media server. Not the best approach…

One solution is to conduct the meeting by spreading it across servers closest to each of the participants:

We use more server resources to get this session served, but we have a lot more control over the media routes so we can optimize them better. This improved media quality for the session.

#2 – Fragmented Allocations

Assume that we can connect up to 100 participants in a single media server. Furthermore, every meeting can hold up to 10 participants. Ideally, we won’t want to assign more than 10 meetings per media server.

But what if I told you the average meeting size is 2 participants? It can get us to this type of an allocation:

This causes a lot of wasted server resources. How can we solve that?

By having people commit in advance to the maximum meeting size. Not something you really want to do
Taking a risk, assume that if you allocate 50% of a server’s capacity, the rest of the capacity you leave for existing meetings allowing them to grow. You still have wasted resources, but to a lower degree. There will be edge cases where you won’t be able to fill out the meetings due to server resources
Migrating sessions across media servers in an effort to “defragment” the servers. It is as ugly as it sounds, and probably just as disrupting to the users
Cascade sessions. Allow them to grow across machines

That last one of cascading? You can do that by reserving some of a media server’s resources for cascading existing sessions to other media servers.

#3 – Larger Meetings

Assuming you want to create larger meetings than one a single media server can handle, your only choice is to cascade.

If your media server can hold 100 participants and you want meetings at the size of 5,000 participants, then you’ll need to be able to cascade to support them. This isn’t easy, which explains why there aren’t many such solutions available, but it definitely is possible.

Mind you, in such large meetings, the media flow won’t be bidirectional. You’ll have fewer participants sending media and a lot more only receiving media. For the pure broadcasting scenario, I’ve written a guest post on the scaling challenges on Red5 Pro’s blog.

Recap

We’ve touched a lot of areas here. Here’s what you should do when trying to decide how many users can fit in your WebRTC calls:

Whatever meeting size you have in mind it is possible to support with WebRTC
1. It will be a matter of costs and aligning it with your business model that will make or break that one
2. The larger the meeting size, the more complex it will be to get it done right, and the more limitations and assumptions you’ll need to add to the equation
Analyze the complexity you need to support
1. Count the incoming and outgoing streams to each device and media server
2. Decide on the video quality (resolution and bitrate) for each stream
Define the media server you’ll be using
1. Select a machine type to run the media server on
2. Figure out the sizing needed before you reach scale out
3. Check if the growth is linear on the server’s resources
4. Decide if you scale out based on bandwidth, CPU, streams count or anything else
Figure how cascading fits into the picture
1. Offer with it better geolocation support
2. Assist in resource fragmentation on the cloud infrastructure
3. Or use it to grow meetings beyond a single media server’s capacity

What’s the size of your WebRTC meetings?