RTC@Scale 2022 summary and insights

By Tsahi Levent-Levi

March 7, 2022

Read the latest RTC@Scale 2023 summary and insights.

RTC@Scale was Facebook’s virtual WebRTC event, covering current and future topics. Here’s the summary so you can pick and choose the relevant ones for you.

WebRTC Insights is a subscription service I have been running with Philipp Hancke for over a year now. The purpose of it is to make it easier for developers to get a grip of WebRTC and all of the changes happening in the code and browsers – to keep you up to date so you can focus on what you need to do best – build awesome applications.

We got into a kind of a flow:

Once every two weeks we finalize and publish a newsletter issue
Once a month we record a video summarizing libwebrtc release notes

It is fun to do and the feedback we’re getting is positive.

That said, being us, means that we can’t really sit still… or in this case – Philipp…

We published this on Monday the week after the event took place to our WebRTC Insights clients, and now, we’re opening it up for everyone as well.

Why an RTC@Scale summary?
KEYNOTE, PANEL AND WRAP-UP
SESSION 1: FUTURE RTC EXPERIENCES
SESSION 2: AUDIO ML
SESSION 3: VIDEO
SESSION 4: RESILIENCE AND ENCRYPTION
Want to try out WebRTC Insights?

Why an RTC@Scale summary?

Philipp decided it would make sense to summarize the recent RTC@Scale “recruiting event” that Facebook did – the RSVP was explicitly asking for consent to be contacted. The technical depth of the talks was amazing so we’ve added an “out of order” issue for you, just for this 😎

The intent is for you to *not* spend 5 hours but rather to focus on the select sessions that are relevant for you.

The event setup was simple:

All sessions were pre-recorded and simply played back at the time of the events
The QA sessions were done live
Everything resided inside a Facebook Live link
The event page is (obviously) on Facebook: https://www.facebook.com/atscaleevents/videos/?ref=page_internal
If you want to watch the full 5 hours, you can use this link

KEYNOTE, PANEL AND WRAP-UP

Real-time Communication for Today and Future Experiences / Maher Saba @ Meta

Product-focused, make your product managers watch
Now this is a good recruiting pitch with all the fancy things you could work on!
One wonders if you will get interviewed on a VR whiteboard when applying…

Panel: RTC in the Metaverse / Sriram Srinivasan, Mike Arcuri, Paul Boustead, and Cullen Jennings

Product-oriented, a lot of talking. Watch with a glass of wine
40 minutes felt too long
The question everyone avoids is “what is Fortnite doing?”

SESSION 1: FUTURE RTC EXPERIENCES

These sessions focus on roadmap and far future views. We’d rather have a bit more on the here and now and the immediate future requirements than what would happen in 3, 5 or 10 years time, but hey – they are recruiting 😉

Holographic Video Calling / Nitin Garg @ Meta

What will the technology stack for holographic video calling look like?
This is 5+ years into the future?
- Encoding a single frame takes 30s currently (on i7 laptop)
- It needs to be ~3ms to be really interesting
Comments on BWE, delay, rate control and FEC are relevant today
- “Typical” behavior of BWE @ 2930s looks far too unstable
Holographic video calling is a nice topic, but niche at the moment. There are a lot more pressing aspects of scale that needs to be dealt with first

Spatial Communications at Scale in Virtual Environments / Paul Boustead @ Dolby

Spatial audio in virtual worlds
- Experience of rotating your head is important
- Render loudest 3 streams is what WebRTC does by default
P2P vs forward vs mixing
- Server side mixing with HRTF (Head Related Transfer Function) vs multichannel spatial codec
- The bigger the group, the more sense it would make to switch to spatial mixing of audio (assuming you’re into spatial audio)
Audio chain considerations
- Watch this part for generally useful considerations

RTC3 / Justin Uberti @ Clubhouse

Great separation into phases, make product manager(s) watch
- Interesting that he classifies 2010-2019 as mobile-driven and 2020+ as meeting-driven. “meetings usage eclipses call usage”
Reliability may be the expectation but who is working on that?
There is a lot to be desired on audio, where WebRTC has (is?) been neglected
WebRTC for music – Who remembers his 2013 Google IO session?
Speech to text is becoming a table stakes feature
We need a better mute button
- But we taught people to mute when not speaking for a decade now…
Group communication and SFUs
- Building a good SFU is still hard, value in e2e stack. Who owns that stack? For the client side that is still Google
- Justin mentions Agora and Twilio in PaaS and large group calls. Twilio is limited at 50 users; there are others with better group calling solutions (Look at Vonage and Daily for example)
- The WebRTC WATCHLISTS file is a really dumb metric to gauge vendors
Unifying RTC and HTTP/QUIC worlds
- How the RTC congestion controller gets along with the QUIC one is unsolved
- Also read here for more thoughts on QUIC and RTC
Unrelated to the content itself – smart cameras with auto zoom can be super annoying
Most of this session was focused on the history of WebRTC and the requirements of Clubhouse (audio-only). While we believe audio is important, video can’t be neglected either

Live QA

Watch if you found the sessions worthwhile
Justin Uberti does not wear the same clothes as in the recorded talk, breaking immersion!

SESSION 2: AUDIO ML

Audio ML is quite interesting. Large vendors are at it, and when (if?) the results will trickle into vanilla WebRTC is yet to be seen. Key takeaway: ML-based noise suppression is more important than echo cancellation these days.

Developing Machine Learning Based Speech Enhancement Models for Teams and Skype / Ross Cutler @ Microsoft

Watch if you care about audio quality but very technical (and scientific)
Specific “what could have been better” questions can turn the common (and somewhat useless) five star rating” into something that is actually actionable
Audio capture pipeline enhancements for noise suppression
- Lots of almost-scientific evaluation
- CPU perf evaluation followed by A/B testing in the fields
Audio capture pipeline enhancements for combined AEC/NS
- No A/B testing results sadly
Packet loss concealment

Can AI Disrupt Speech Compression? / Jan Skoglund @ Google

Watch if you want to learn more about audio codecs
Use-case is 2G/3G connections and limited data plans
WaveNet sounds drunk with background noise or music
Lyra and SoundStream
- Realtime performance on a smartphone CPU
Lots of listening comparisons
Combine denoiser and codec
Guess what kind of music he plays 🎸

Live QA

Watch if you found the sessions worthwhile

SESSION 3: VIDEO

AV1 is coming. It will take time to be here. To get a grip over it and see what companies are doing, we got Google and Visionular.

Google is what goes inside WebRTC. Visionular is what you can buy commercially on the market for server or proprietary implementations.

Your focus should probably be in low bitrates and slide sharing scenarios.

AV1 Encoder for RTC / Marco Paniconi @ Google

Watch many times if you are a video expert. Otherwise just read this summary
RTC requirements differ from “encode a video”. Encoding screen share? We got you!
There is a “webrtc team” they are working with?
- Ah, the one that maintains apprtc… which is down. Yes there is a deployment guide but… can you click the link? No…. (we’re still frustrated like many at taking down appr.tc with no public explanation and so surprisingly)
- AV1X” is gone as of M96. See PSA. Missing from the release notes of course!
Unsurprisingly Duo and Meet are the use-cases driving this
- Make sure to review the BW reqs on that slide
AV1 is being tested in Meet for screen share? We will monitor!
- AV1 has a special mode for screen sharing
SVC is there but the WebRTC-SVC API to enable it is not making much progress

AV1 for RTC: Current and Future / Zoe Liu @ Visionular

Easier to follow than Marco due to being a more sales-y deck
Watch if you are considering licensing what Visionular oes
A bit long for a sales deck
Lots of numbers, great if you understand those
apprtcmobile is … well, the state of that is unclear

Live QA

Watch if you found the sessions worthwhile
AV1 in Duo was low-bitrates, low resolutions. Tsahi predicted this would be the roll-out pattern
No, SVC is not there yet (as an API). Unless it is enabled by SDP munging too…?

SESSION 4: RESILIENCE AND ENCRYPTION

We found this part to be most applicable to current problems. This is where you should be spending your time and focus right now

Making Meta RTC Audio More Resilient / Andy Yang @ Meta

Highly applicable to WebRTC today. A primer on audio resilience, watch!
The presentation style is a very welcome change, giving a roadmap!
- As developers explaining the impact of your work is important
Excellent of common audio problems resulting from packet loss and jitter
Great comparison between NACK, opus FEC and RED
- …and how the mechanisms work in detail
- NACK for audio is a nonstandard feature. See here
- Note that opus in-band FEC has reduced quality and that “no additional bitrate overhead for FEC” is not a good idea while video is active.
  - Good explanation of the downside of in-band FEC for the SFU (removing FEC is possible but nontrivial)
  - The other main problem with in-band FEC is the lack of a control surface
- Duplication adapting to bursty loss is theoretically interesting
- SFUs adaptation of RED was brought up by Jitsi’s Boris on WebRTCHacks
- Bandwidth adaptation of RED in libwebrtc/chrome is not solved yet
Resiliency recap
- This is a great slide but WebRTC support for “duplication” is wrong, it was there and is available in Chrome as of M96
- Overprotection is a problem, RED+fec makes no sense
- Here’s how we’d summarize these techniques:

Make sure to read about opus FEC and RED here

Resiliency vs delay
- Classic E-model diagram
- Great latency analysis of the stack with breakdown of the budget
- A rare NetEQ and jitter buffer explanation. NetEQ remains relevant a decade after the GIPS acquisition
- Note that there is no RTX for audio so the packet may be treated as “just” late (a plain resend). This is a major issue for video where rtx is used most of the time to avoid this problem. Do we need RTX for audio? Maybe…
- NACK and retransmissions will increase the jitter buffer delay otherwise?
- WebRTC in the browser does offer a very limited control surface for this kind of experimentation… but it is clearly necessary
Technical metrics vs actual user perception
- Measuring technical metrics (see e.g. RED post on hacks) is easy
- Actual perception is hard
- A very open problem indeed!
Summary – rewind, watch!
- We want to know your story, tell our recruiter. Great pitch!

Private Calling at WhatsApp / Xi Deng @ Meta

Again, giving a roadmap and mission statement is great!
15 billion minutes talking on whatsapp each day…
- Remember the 2018 “3 billion monthly” for Chrome?
- One wonders how they compare to the largest telcos in the world
Great definition of “privacy” when it comes to calling. Metadata? Such a pun!
Interesting threat scenario
- “no trust to faceless corporations” (how meta can Meta be?)
- Do not leak location (or IP) to strangers. Zoom auto accept anyone?
- Multi-device messaging and calling is a hard problem
Conflict for using data to improve service
- What metrics are sensitive and which ones can you use to improve?
Private 1:1 calls
- Pass-through servers seem like a relic of Whatsapp starting with XMPP as a protocol back in the day
- Multi-device diverges from modern XMPP though
- See also later slide on challenges of client-centric multi device
Decoupled relay server
- The Whatsapp stack seems still different from the Messenger one and not using “standard” terminology even
- Electing a common relay server seems wrong. ICE does not require that
- Whatsapp seems to use a relay-first approach with opportunistic P2P4121
- Disabling P2P for “strangers” is a very good practice

E2EE for media content
- SRTP RFC 3711 does not provide E2EE. “master secret” is a very specific SRTP term. This is equivalent to SDES (boo) but is protected by E2EE (using the Signal protocol) which makes it ok-ish…
- Having to generate different master secrets for different devices seems bad compared to DTLS-SRTP
- It is concerning that Whatsapp continues to use SDES effectively and does not consider DTLS-SRTP (with its small setup latency) to be a solution
- Identity is already a problem for chat messages. One wonders what percentage of sessions have a verified identity
Audio-video switch
- A classic example of signaling glare
- Unclear why a distributed consent algorithm is needed
- The use-case for “oh my phone is an actual phone and can not do video” is shrinking
Multiparty
- In XMPP terms the “group call storage” would be a MUC room
- Selecting the best SFU makes more sense here than for relay servers
- Warp protocol might be a frame header in the RTP payload before the actual codec payload
- Unclear why the “master secret” which is a SRTP term (and hence on the leg between client and SFU) needs to change when participants join or leave
Recruiting pitch at the end too!

Group Call End-to-End Encryption and the Challenges of Encrypting Large Calls / Abo-Talib Mahfoodh @ Meta

Highly relevant if you are looking at E2EE for WebRTC
And another session with a mission and roadmap!
Recap of the SFU architecture and what it means for encryption
Where does frame encryption happen in the client pipeline
- libwebrtc provides the FrameEncryptorInterface and FrameDecryptorInterface since 2018 but no implementations. Insertable Streams could not reuse those sync interfaces
Key negotiation
- Sender key vs session key approaches
- Session key is weaker than E2EE and only protects from the SFU which is still relevant in some use-cases
- Note that the sender key is symmetric and all receivers must know it to decrypt, but they could encrypt with it. This is not a problem since the receivers can not send media with the SSRCs of the sender so impersonation is not possible
- Joining the call requires a ratchet operation (which is cheap)
- Someone leaving the call requires a rekey which is O(n^2) so expensive

Scaling group call E2EE
- How large do you need to scale at? A meeting with 100 participants is not “private” so session keys might be more appropriate
- Prioritizing key exchange based on whether you are planning to send becomes important
- Rekey is expensive and larger calls have a higher participant churn making this a hard problem. A small time window to batch this operation helps
- Failure to deliver rekey messages is odd, signaling has to be reliable or something is wrong with your overall system
No recruiting pitch?!

Live QA

Watch if you found the sessions worthwhile
out-of-band FEC does not work for audio due to the latency increase. It works for video where frames are split into multiple packets
“RS code” is Reed-Solomon FEC, https://datatracker.ietf.org/doc/html/rfc5510
MLS is the IETF effort for standardizing group key exchange

Want to try out WebRTC Insights?

What you are seeing here isn’t the run of the mill issue of a WebRTC insights newsletter. It wasn’t even intended. But it does show the effort and focus we put on everything WebRTC for our clients. Watching a five hour event twice and producing actionable notes is not an easy task. It changed our weekend plans but we ended up being very satisfied with the results if only for our own notes.

If your company is relying heavily on WebRTC, then you should at the very least try this out. Reach out to me via the form at the end of the WebRTC Insights landing page and I’ll send you a sample issue.

RTC@Scale 2022 summary and insights

Table of contents

Why an RTC@Scale summary?

KEYNOTE, PANEL AND WRAP-UP

SESSION 1: FUTURE RTC EXPERIENCES

SESSION 2: AUDIO ML

SESSION 3: VIDEO

SESSION 4: RESILIENCE AND ENCRYPTION

Want to try out WebRTC Insights?

You may also like

The future of Video APIs is… AI: LiveKit, Daily and Cloudflare this month

WebRTC gives voice to LLMs

Leave a Reply