Handling session disconnections in WebRTC

08/04/2019

WebRTC disconnections are quite common, but you can “fix” many of them just by careful planning and proper development.

Years ago, I developed the H.323 Protocol Stack at RADVISION (later turned Avaya, turned Spirent turned Softil). I was there as a developer, R&D manager and then the product manager. My code is probably still in that codebase, lovingly causing products around the globe to crash from time to time – as any other developer, I have my share of bugs left behind.

Anyways, why am I mentioning this?

I had a client asking me recently about disconnections in WebRTC. And it kinda reminded me of a similar issue (or set of issues) we had with the H.323 stack and protocol years back.

If you bear with me a bit – I promise it will be worth your while.

I am starting this week the office hours for my WebRTC course. The next office hour (after the initial “hi everyone”) will cover WebRTC disconnections.

Check out the course – and maybe go over the first module for free:

A quick intro to H.323 signaling and transport

H.323 is like SIP just better and more complex. At least for me, who started his way in VoIP with H.323 (I will always have a soft spot for it). For many years, the way H.323 worked is by opening two separate TCP connections for transporting its signaling. The first for passing what is called Q.931 protocol and the next for passing H.245 protocol.

If you would like to compare it to the way WebRTC handles things, then Q.931 is how you setup the connection – have the users find each other. H.245 is similar to what SDP and JSEP are for (I am blatantly ignoring H.225 here, another protocol in H.323 which takes care of registration and authentication).

Once Q.931 and H.245 get connected, you start adding the RTP/RTCP stuff over UDP, which gets you quite a lot of connections.

Add to that complexities like tunneling H.245 over Q.931, using something called faststart instead of H.245 (or before H.245), then sprinkle a dash of “parallel H.245” and then a bit of NAT traversal and/or security and you get a lot of places that require testing and a huge number of edge cases.

Where can H.323 get “stuck” or disconnected?

With so many connections, there are a lot of places that things can go wrong. There are multiple state machines (one for Q.931 state, one for H.245 state) and there are different connections that can get severed for one reason or another.

Oh – and in H.323 (at least in its earlier specifications that I had the joy to work with), when the Q.931 or H.245 connections get severed – the whole session is considered as disconnected, so you go and kill the RTP/RTCP sessions.

At the time, we suffered a lot from zombie sessions due to different edge cases. We ended up with solutions that were either based on the H.323 specification itself or best practices we created along the way.

Here are a few of these:

  • If the Q.931 connection gets severed – kill the session
  • If the H.245 connection gets severed – kill the session
  • If you don’t receive media or media control packets on RTP or RTCP respectively for a configurable period of time (think 5-10 seconds) – kill the session
  • When a state machine for Q.931 or H.245 initiates – start a timer. If that timer ends and the state machine didn’t get to the connected state – switch the state to timeout and… – kill the session
  • Killing the session means trying to gracefully close all connections, but if we can’t within a short period of a timeout – we just shut things down to collect the resources back to be used later

H.323 existed before smartphones. Systems were usually tethered to an ethernet cable or at most over WiFi in a static location at a time. There was no notion of roaming or moving between networks. Which meant that there was no need to ask yourself if a connection got severed because of a switch in the network or because there’s a real issue.

Life was simple:

And if you were really insistent then maybe this:

(in real life scenarios, these two simplistic state machines were a lot bigger and complicated, but their essence was based on these concepts)

Back to WebRTC signaling and transport

WebRTC is simpler and more complicated than H.323 at the same thing.

It is simpler, as there is only SRTP. There’s no signaling that is standardized or preselected for WebRTC. And for the most part, the one you use will probably require only a single connection (as opposed to the two in H.323). It also has a lot less alternatives built into the specification itself that H.323 has.

It is more complicated, as you own the signaling part. You make that selection, so you better make a good one. And while at it, implement it reasonably well and handle all of its edge cases. This is never a simple task even for simple signaling protocols. And it’s now on you.

Then there’s the fact that networks today are more complex. User expect to move around while communicating, and you should expect such scenarios where users switch networks in mid-session.

If you use WebRTC in a browser, then you get these interesting aspects associated with your implementation:

  1. When you close the browser, the session dies
  2. When you close the tab where the WebRTC session lives, the session dies
  3. When you refresh the page where the WebRTC session lives, the session dies
  4. When you click a link to move to a different page (even on the same site), the session dies

A lot of dying taking place on the browser, and the server, or the other client, will need to “sniff” these scenarios as they might not be gracefully disconnected, and decide what to do about them.

Where can WebRTC get “stuck” or disconnected?

We can split disconnections of WebRTC into 3 broad categories:

  1. Failure to connect at all
  2. Media disconnections
  3. Signaling disconnections

In each, there will be multiple scenarios, defining the reasons for failure as well as how to handle and overcome such issues.

In broad strokes, here’s what I’d do in each of these 3 categories:

#1 – Failure to connect at all

There’s a decent amount of failures happening when trying to connect WebRTC sessions. They start from not being able to even send out an SDP, through interoperability issues across browsers and devices to ICE negotiation failing to connect media.

In many of these cases, better configuration of the service as well as focus on edge cases would improve the situation.

If you experience connection failures for 10% or more of the sessions – you’re doing something wrong. Some can get it as low as 1% or less, but oftentimes that depends on the type of users your service attracts.

This leads to another very important aspect of using WebRTC:

Measure what you can if you want to be able to improve it in the future

#2 – Media disconnections

Sometimes, your sessions will simply disconnect.

There are many reasons why that can happen:

  • The firewall policies of the access point used are configured to kill P2P encrypted traffic (blame all them bittorrent-hating-IT-people)
  • The user switched from one network to another in mid-session, and you should follow WebRTC’s ICE restart mechanism
  • The other end crashed, closed or just got offline

Each of these requires different handling – some in the code while others some manual handling (think customer support working out the configuration with a customer to resolve the firewall issue).

#3 – Signaling disconnections

Unlike H.323, if signaling gets disconnected, WebRTC doesn’t even know about it, so it won’t immediately cause the session itself to disconnect.

First thing you’ll need to do is make a decision how you want to proceed in such cases – do you treat this as session failure/disconnection or do you let the show go on.

If you treat these as failures, then I suggest killing peer connections based on the status of your websocket connection to the server. If you are on the server side, then once a connection is lost, you should probably go ahead and kill the media paths – either from your media server towards the “dead” session leg or from the other participant on a P2P connection/session.

If you want to make sure the show goes on, you will need to try and reconnect the peer connection towards the same user/session somehow. In which case, additional signaling logic in your connection state machine along with additional timers to manage it will be necessary.

Announcing the WebRTC course snippets module

Here’s the thing.

My online WebRTC training has everything in it already. Well… not everything, but it is rather complete. What I’ve noticed is that I get repeat questions from different students and clients on very specific topics. They are mostly covered within lessons of the course, but they sometimes feel as being “buried” within the hours and hours of content.

This is why I decided to start creating course snippets. These are “lessons” that are 3-5 minutes long (as opposed to 20-40 minutes long), with a purpose to give an answer to one specific question at a time. Most of the snippets will be actionable and may contain additional materials to assist you in your development. This library of snippets will make up a new course module.

Here are the first 3 snippets that will be added:

  1. WebRTC session disconnections
  2. ICE servers configuration
  3. A Quick review of QUIC

While we’re at it, office hours for the course start today. If you want to learn WebRTC, now is the best time to enroll.