WebRTC Resilience

By Tsahi Levent-Levi

March 16, 2026

The journey to a reliable WebRTC service is paved with difficult choices: where do you stop investing in resilience, and how do you strike the perfect balance between cost, complexity, and a bulletproof user experience?

WebRTC resilience runs at multiple levels and layers of a service. It isn’t enough to focus on the media plane or the signaling plane only. You’ll need to figure out the whole infrastructure and pick and choose where to invest your effort.

This time, I want to outline the challenges and suggest the various best practices and solutions.

Key Takeaways

WebRTC resilience requires investment across multiple infrastructure layers, not just media and signaling planes
Recent outages demonstrate the importance of resilience in WebRTC; outages can last for hours and impact major services
Best practices for resilience include managing client behavior, ensuring media server redundancy, and optimizing signaling server connections
Organizations must weigh their resilience strategies against costs, balancing factors like blast radius and availability zones
Using managed services, like CPaaS providers, can simplify WebRTC resilience by outsourcing infrastructure concerns

Why resilience in WebRTC is now a “thing”
High availability of WebRTC infrastructure
Regions and data centers
Routes less traveled with WebRTC reliability
Figuring out your WebRTC strategy

Why resilience in WebRTC is now a “thing”

In the last couple of months we’ve seen several big outages. They took out parts of the internet, and with it, also some widely known and used WebRTC services and service providers.

Such outages are both common and rare.

I’ve asked a random AI tool to cobble up a nice list of the last 12 months and got this in return (thank ChatGPT this time). Here are the top 10 out of 35 incidents it has given me:

Date	Length	Vendor	Region	Failed Component	Major Services Impacted	Severity
May 7 2025	~3h	Cloudflare	Global	Workers control-plane database failure	SaaS apps running on Workers	High
Jun 12 2025	~2–3h	GCP	Multi-region	Service Control dependency overload	Spotify, Discord	High
Jul 14 2025	~2h	Cloudflare	Global	Network routing configuration issue	Discord, GitHub Pages sites	High
Oct 9 2025	~2h	Azure	Global	Identity / access service outage	Xbox Live, Microsoft services	High
Oct 20 2025	~15h	AWS	us-east-1	DNS / DynamoDB dependency cascade	Snapchat, Netflix, Reddit, Duolingo	High
Oct 20 2025	~15h	AWS	us-east-1	Internal service cascade (~140 AWS services)	Fortnite, Roblox	High
Oct 29 2025	~3h	Azure	Multi-region	Networking configuration rollout issue	Starbucks app, Minecraft	High
Nov 18 2025	~3h	Cloudflare	Global	Bot management config generation bug	ChatGPT, X, Canva, Shopify, Indeed	High
Dec 5 2025	~25m	Cloudflare	Global	WAF emergency patch bug	LinkedIn, Zoom, Canva	Medium
Jan 26 2026	~3h	DigitalOcean	Global control plane	API / control panel outage	droplet management for hosted SaaS apps	Medium

Some quick observations here:

AWS fumbled the ball here with 15 hour outages. The rest were a lot shorter in length…
Everyone fumbles and has outages. Everyone. And these affect even large and established vendors. No matter how prepared you are – your service will fail at some point
The most common reason? Configuration mistakes…

Then there’s the war with Iran. Where Iran decided to target data centers as part of their retaliation:

Here, and in future wars, it is quite obvious that data centers will be valid targets.

In many ways, you need to plan, design and execute for such outcomes – be it a minor 15 minute outage, to a total loss of a regional data center due to a direct missile hit…

How exactly one should weigh the need of resilience versus data sovereignty while using the public cloud is anyone’s guess. For the most part, I’ll be skipping this conundrum, focusing on things you can actually do and work with.

High availability of WebRTC infrastructure

Let’s start by mapping our WebRTC infrastructure, to understand what kind of an architectural change each one requires due to resilience requirements.

Client app

Here’s an important tidbit – with WebRTC, the client application is an important player in resilience. You’re likely already doing things here to deal with people refreshing their page on purpose or by mistake by hitting F5 or closing a tab and reopening the URL.

What you are doing is placing a solution to the problem of being resilient in front of a user’s behavior with web browsers. What we need to add into the equation is adding solutions to the problem of being resilient to infrastructure outages.

The main things to consider here are these:

Assume signaling connections will get severed. Decide how to act if they do – do you reconnect them automatically? Notify the user and retry the connection? Just stop the meeting and let the user figure out what to do next?
Media flows are going to drop on you. Again, this is due to the nature of networks (or just moving around with a smartphone on a call). The same solutions you have for dealing with user mobility and network fluctuations can and should be extended to media server failures

In most cases, you need to “rise above” the WebRTC peer connection mindset and extend it towards scenarios where the peer connections fail but a meeting/session can and should still be maintained.

WebRTC gives you the ICE restart functionality to be able to renegotiate media connections.

Be sure to make use of it. On top of it, implement a mechanism to recreate a peer connection if and when the need arises (because it will)
Also figure out how to handle when media is down right after signaling is down – how do you deal with this edge case

Media

Here we’re looking at WebRTC media servers. You may have multiple types of such servers – SFU, MCU, gateways or recording servers.

While media servers are stateful, they are easier in nature to migrate away from, making it natural to add resilience to them – especially when the same solutions assist in scaling and maintenance. You can direct users to new media servers rather easily when needed and resume with the service along with 1-2 seconds of disruption. It isn’t nice, but it is better than nothing.

Some best practices here include:

Keeping the number of users on a single media server at a low 100’s of users at most. This will ensure smaller blast radius of machines going down
Having a functionality that flushes meetings off a media server when an upgrade or maintenance is required will serve you well for resilience as well
- This can be by limiting meeting duration and then forcing the server to shut down
- Have grace time before killing servers or wait it out for all sessions to finish (with a timeout… to shut it down anyway)
- The logic of getting it done usually assumes the client is self-sufficient enough to move to a new media server mid-session (did you read the previous section?). My suggestion is to make it happen
Have the ability to connect users to different media server regions and not lock yourself to a “closest” only paradigm. Not many implement this one

NAT traversal

TURN servers are also stateless, but hold less state than media servers. In a large deployment, you will have multiple TURN servers per availability zone as well as multiple regions supported.

When some fail on you, WebRTC is capable enough to renegotiate routes to other TURN servers. The way this is usually handled? Using ICE restarts, so if you’ve implemented it properly, you won’t even notice that the reason is a failed TURN server – just rerouting to a better TURN server/connection.

What you will need to do is know about failures quickly to reroute the traffic properly. Things that DNS round robin routing does rather well today.

My biggest suggestion here? Just use a reputable third party for a managed TURN service. But be sure to check what they do for resilience (obviously).

Signaling

Signaling servers and resilience is tricky. They are stateful in their nature. Usually with WebSocket connections to all active users. They are also great at scale, dealing with 1000’s if not 10,000’s of users on each server.

Their blast radius is such that when a server goes down, many sessions are going to suffer from it and shutdown.

Here are some thoughts on what’s needed:

Assume signaling servers may fail
- Have the client application logic take care of trying to reconnect
- If possible, give it time in which media servers still operate while you are trying to reconnect – as opposed to shutting down the call altogether and retrying everything from scratch
- When thousands of users try to reconnect immediately, that gets you to the thundering herd problem, where too many devices try to reconnect at once to a new server causing more problems. You may want to throttle or space them out a bit with random timeouts prior to reconnection attempts
Try to make signaling as stateless as possible for these servers. Use message queues, in memory data grids such as redis, etc. This won’t eliminate the problem but will make it more manageable

Load balancers

These aren’t WebRTC specific, but there’s more to them with WebRTC.

When a user “lands” on a WebRTC service, he first interacts with the HTTP application server which directs it to the signaling server. These are most likely allocated using classic load balancers. The application? Usually found in a single region for most implementations. The signaling? Occasionally spread out geographically but latency in the order of 100ms does not really matter here so this might be for resilience.

This explainer on signaling allocation logic from Google on Google Duo is quite interesting (and overengineered):

But I digress… back to load balancers.

The media servers are different. In many cases, these load balancers have actual application logic. The balancing act might happen elsewhere or in a custom fashion. That’s due to the nature of the traffic (UDP based) and the type of decisions required (connecting multiple users to the same session while trying to get each user to a close media server).

Resilience here comes in two stages:

Making sure you know when machines are down and are able to reroute traffic accordingly – and doing that for every layer of the WebRTC infrastructure in the way that layer needs it
Hardening and introducing resilience to the load balancers themselves. If you rely on Route 53 and it drops… What then?

Application

Then there’s the application itself. This is out of scope of WebRTC, but definitely something to address for resilience.

Here you cram things like the application servers, databases, redis instances, routing logic, DNS, etc.

Be sure to get your application resilience in order for all the bits and pieces that aren’t WebRTC. That includes any 3rd parties you might be using that are out of your control – figuring out how their outage causes as little harm as possible is important.

Regions and data centers

Resilience isn’t just about high availability and rerouting traffic. It is also about some important decisions you need to make about your infrastructure. Here are a few that immediately come to mind.

Making use of availability zones

Use multiple availability zones in a single region for a given IaaS vendor.

At times, one zone might be down while others continue to work without failure.

Making use of this characteristic of IaaS regions is best practice when it comes to resiliency of web applications today and in that sense, WebRTC services aren’t any different.

I can say from my CPaaS and Video API reports that many of these vendors for example, make use of this and also have multiple close-by regions to handle in a similar way – when one US-East region fails, there’s another data center region in US-East that can take the burden for the time being for example.

Handling a region outage

What do you do when a whole region fails?

Two types of regions here to discuss – the hub and your spokes. The hub is where application logic and databases reside. Usually also signaling. The spokes are where media servers and TURN servers will be found (there are more of them to keep them closer to users).

For the spokes, the assumption is that a failure of a region needs to reroute traffic from users to the closest other region that is available.

So if a Frankfurt region is down, traffic might be redirected to Paris or Ireland data centers for example.

It makes sense to keep available capacities well above 10-20% of total capacities. This can come in handy for quick recuperation and establishment of traffic if one region fails altogether.

For the hub(s), a different high availability and fault tolerance is usually necessary. One where context and state are being stored and managed across regions using distributed data grids. Since it is harder to implement and pull off, it is often skipped. Until AWS us-east-1 fails…

Single vendor vs multi vendor IaaS

Another important question is do you go for a single IaaS vendor strategy or a multiple one. Do you go “all in” for AWS, or are you sharing the brunt across vendors?

Some applications end up with signaling and application logic on AWS and media servers and TURN spread across multiple vendors (Digital Ocean or Oracle due to bandwidth pricing considerations usually).

This isn’t resilience. It is just cost planning.

For resilience, what we’re looking for is to have multiple IaaS vendors used for media servers and TURN – and even for the other signaling and application servers.

You can, for example, keep the second IaaS vendor up and running with 0 servers allocated, increasing the server count there to handle load spikes when they take place – due to outages in the other IaaS vendor or due to the nature of your service.

This approach is less popular – but does exist in the industry.

Routes less traveled with WebRTC reliability

There are additional approaches I see vendors take. Some are better than others.

They are here for completion, but also to open up your creativity to what’s possible.

How much is enough reliability

Here’s something you need to decide: how much reliability is enough for you?

At what point do you decide to stop the investment in reliability as this is quite expensive to build and operate.

The bigger you are, the more sense it makes to invest in this – economies of scale and similar concepts at play here.

This is why everything I’ve explained above can be seen as shades of gray where you pick and choose how far to go with these solutions.

Blast radius considerations

When servers, availability zones or regions fail – how big is the impact? For how long will there be a service interruption? How long will sessions take to reconnect? How many users and meetings are going to “go down the drain”?

Going for smaller machines means smaller blast radius and likely better implementation for resilience on your part. But it usually also means less scale for larger sessions.

You’ll need to strike a balance here that you’re comfortable with.

Managed CPaaS or Video API versus build on your own

This stuff is complicated. For most of us, using a managed service is the better approach.

We outsource this problem to others. For the same reason we use AWS and don’t build our own server farms (I know – some of you do), we shouldn’t be delving into building our own WebRTC infrastructure.

My best suggestion to anyone starting out is to use a Video API vendor. Someone else will now be in charge of figuring out resilience – and you can always ask them how he does that to get some confidence in them first.

There are many outages for IaaS vendors, but they are likely smaller and shorter than vendors who don’t rely on their services. What you lose in control of the situation you probably gain in actual uptime.

Multivendor on the CPaaS layer

I’ve seen some who argue for the use of multiple Video API or CPaaS vendors. This gives you negotiation power in front of them and also saves you when one of them has an outage and the other(s) don’t.

I don’t think this is worth it. The loss of features when going to the lowest common denominator of capabilities here, coupled with behavior changes across these platforms – especially for video services – is going to make this a headache and in a way, cause you to innovate slower than you should.

Figuring out your WebRTC strategy

Be sure to read Gustavo’s take on outages and resilience in WebRTC. It is a good read from someone who’s built such systems at scale more than once.

If you need help figuring out where to take your WebRTC service, how to build resilience into it or how to optimize the media experience you offer – just contact me.