RTC@Scale 2022 summary and insights

By Tsahi Levent-Levi

March 7, 2022  

Read the latest RTC@Scale 2023 summary and insights.

RTC@Scale was Facebook’s virtual WebRTC event, covering current and future topics. Here’s the summary so you can pick and choose the relevant ones for you.

WebRTC Insights is a subscription service I have been running with Philipp Hancke for over a year now. The purpose of it is to make it easier for developers to get a grip of WebRTC and all of the changes happening in the code and browsers – to keep you up to date so you can focus on what you need to do best – build awesome applications.

We got into a kind of a flow:

  • Once every two weeks we finalize and publish a newsletter issue
  • Once a month we record a video summarizing libwebrtc release notes

It is fun to do and the feedback we’re getting is positive.

That said, being us, means that we can’t really sit still… or in this case – Philipp…

We published this on Monday the week after the event took place to our WebRTC Insights clients, and now, we’re opening it up for everyone as well.

Why an RTC@Scale summary?

Philipp decided it would make sense to summarize the recent RTC@Scale “recruiting event” that Facebook did – the RSVP was explicitly asking for consent to be contacted. The technical depth of the talks was amazing so we’ve added an “out of order” issue for you, just for this 😎

The intent is for you to *not* spend 5 hours but rather to focus on the select sessions that are relevant for you.

The event setup was simple:

KEYNOTE, PANEL AND WRAP-UP

Real-time Communication for Today and Future Experiences / Maher Saba @ Meta

  • Product-focused, make your product managers watch
  • Now this is a good recruiting pitch with all the fancy things you could work on!
  • One wonders if you will get interviewed on a VR whiteboard when applying…

Panel: RTC in the Metaverse / Sriram Srinivasan, Mike Arcuri, Paul Boustead, and Cullen Jennings

  • Product-oriented, a lot of talking. Watch with a glass of wine
  • 40 minutes felt too long
  • The question everyone avoids is “what is Fortnite doing?”

SESSION 1: FUTURE RTC EXPERIENCES

These sessions focus on roadmap and far future views. We’d rather have a bit more on the here and now and the immediate future requirements than what would happen in 3, 5 or 10 years time, but hey – they are recruiting 😉

Holographic Video Calling / Nitin Garg @ Meta

  • What will the technology stack for holographic video calling look like?
  • This is 5+ years into the future?
    • Encoding a single frame takes 30s currently (on i7 laptop)
    • It needs to be ~3ms to be really interesting
  • Comments on BWE, delay, rate control and FEC are relevant today
    • “Typical” behavior of BWE @ 2930s looks far too unstable
  • Holographic video calling is a nice topic, but niche at the moment. There are a lot more pressing aspects of scale that needs to be dealt with first

Spatial Communications at Scale in Virtual Environments / Paul Boustead @ Dolby

  • Spatial audio in virtual worlds
    • Experience of rotating your head is important
    • Render loudest 3 streams is what WebRTC does by default
  • P2P vs forward vs mixing
    • Server side mixing with HRTF (Head Related Transfer Function) vs multichannel spatial codec
    • The bigger the group, the more sense it would make to switch  to spatial mixing of audio (assuming you’re into spatial audio)
  • Audio chain considerations
    • Watch this part for generally useful considerations

RTC3 / Justin Uberti @ Clubhouse

  • Great separation into phases, make product manager(s) watch
    • Interesting that he classifies 2010-2019 as mobile-driven and 2020+ as meeting-driven. “meetings usage eclipses call usage”
  • Reliability may be the expectation but who is working on that?
  • There is a lot to be desired on audio, where WebRTC has (is?) been neglected
  • WebRTC for music – Who remembers his 2013 Google IO session?
  • Speech to text is becoming a table stakes feature
  • We need a better mute button
    • But we taught people to mute when not speaking for a decade now…
  • Group communication and SFUs
    • Building a good SFU is still hard, value in e2e stack. Who owns that stack? For the client side that is still Google
    • Justin mentions Agora and Twilio in PaaS and large group calls. Twilio is limited at 50 users; there are others with better group calling solutions (Look at Vonage and Daily for example)
    • The WebRTC WATCHLISTS file is a really dumb metric to gauge vendors
  • Unifying RTC and HTTP/QUIC worlds
    • How the RTC congestion controller gets along with the QUIC one is unsolved
    • Also read here for more thoughts on QUIC and RTC
  • Unrelated to the content itself – smart cameras with auto zoom can be super annoying
  • Most of this session was focused on the history of WebRTC and the requirements of Clubhouse (audio-only). While we believe audio is important, video can’t be neglected either

Live QA

  • Watch if you found the sessions worthwhile
  • Justin Uberti does not wear the same clothes as in the recorded talk, breaking immersion!

SESSION 2: AUDIO ML

Audio ML is quite interesting. Large vendors are at it, and when (if?) the results will trickle into vanilla WebRTC is yet to be seen. Key takeaway: ML-based noise suppression is more important than echo cancellation these days.

Developing Machine Learning Based Speech Enhancement Models for Teams and Skype / Ross Cutler @ Microsoft

  • Watch if you care about audio quality but very technical (and scientific)
  • Specific “what could have been better” questions can turn the common (and somewhat useless) five star rating” into something that is actually actionable
  • Audio capture pipeline enhancements for noise suppression
    • Lots of almost-scientific evaluation
    • CPU perf evaluation followed by A/B testing in the fields
  • Audio capture pipeline enhancements for combined AEC/NS
    • No A/B testing results sadly
  • Packet loss concealment

Can AI Disrupt Speech Compression? / Jan Skoglund @ Google

  • Watch if you want to learn more about audio codecs
  • Use-case is 2G/3G connections and limited data plans
  • WaveNet sounds drunk with background noise or music
  • Lyra and SoundStream
    • Realtime performance on a smartphone CPU
  • Lots of listening comparisons
  • Combine denoiser and codec
  • Guess what kind of music he plays 🎸

Live QA

  • Watch if you found the sessions worthwhile

SESSION 3: VIDEO

AV1 is coming. It will take time to be here. To get a grip over it and see what companies are doing, we got Google and Visionular.

Google is what goes inside WebRTC. Visionular is what you can buy commercially on the market for server or proprietary implementations.

Your focus should probably be in low bitrates and slide sharing scenarios.

AV1 Encoder for RTC / Marco Paniconi @ Google

  • Watch many times if you are a video expert. Otherwise just read this summary
  • RTC requirements differ from “encode a video”. Encoding screen share? We got you!
  • There is a “webrtc team” they are working with?
    • Ah, the one that maintains apprtc… which is down. Yes there is a deployment guide but… can you click the link? No…. (we’re still frustrated like many at taking down appr.tc with no public explanation and so surprisingly)
    • AV1X” is gone as of M96. See PSA. Missing from the release notes of course!
  • Unsurprisingly Duo and Meet are the use-cases driving this
    • Make sure to review the BW reqs on that slide
  • AV1 is being tested in Meet for screen share? We will monitor!
    • AV1 has a special mode for screen sharing
  • SVC is there but the WebRTC-SVC API to enable it is not making much progress

AV1 for RTC: Current and Future / Zoe Liu  @ Visionular

  • Easier to follow than Marco due to being a more sales-y deck
  • Watch if you are considering licensing what Visionular oes
  • A bit long for a sales deck
  • Lots of numbers, great if you understand those
  • apprtcmobile is … well, the state of that is unclear

Live QA

  • Watch if you found the sessions worthwhile
  • AV1 in Duo was low-bitrates, low resolutions. Tsahi predicted this would be the roll-out pattern
  • No, SVC is not there yet (as an API). Unless it is enabled by SDP munging too…?

SESSION 4: RESILIENCE AND ENCRYPTION

We found this part to be most applicable to current problems. This is where you should be spending your time and focus right now

Making Meta RTC Audio More Resilient / Andy Yang @ Meta

  • Highly applicable to WebRTC today. A primer on audio resilience, watch!
  • The presentation style is a very welcome change, giving a roadmap!
    • As developers explaining the impact of your work is important
  • Excellent of common audio problems resulting from packet loss and jitter
  • Great comparison between NACK, opus FEC and RED
    • …and how the mechanisms work in detail
    • NACK for audio is a nonstandard feature. See here
    • Note that opus in-band FEC has reduced quality and that “no additional bitrate overhead for FEC” is not a good idea while video is active.
      • Good explanation of the downside of in-band FEC for the SFU (removing FEC is possible but nontrivial)
      • The other main problem with in-band FEC is the lack of a control surface
    • Duplication adapting to bursty loss is theoretically interesting
    • SFUs adaptation of RED was brought up by Jitsi’s Boris on WebRTCHacks
    • Bandwidth adaptation of RED in libwebrtc/chrome is not solved yet
  • Resiliency recap
    • This is a great slide but WebRTC support for “duplication” is wrong, it was there and is available in Chrome as of M96
    • Overprotection is a problem, RED+fec makes no sense
    • Here’s how we’d summarize these techniques:
  • Resiliency vs delay
    • Classic E-model diagram
    • Great latency analysis of the stack with breakdown of the budget
    • A rare NetEQ and jitter buffer explanation. NetEQ remains relevant a decade after the GIPS acquisition
    • Note that there is no RTX for audio so the packet may be treated as “just” late (a plain resend). This is a major issue for video where rtx is used most of  the time to avoid this problem. Do we need RTX for audio? Maybe…
    • NACK and retransmissions will increase the jitter buffer delay otherwise?
    • WebRTC in the browser does offer a very limited control surface for this kind of experimentation… but it is clearly necessary
  • Technical metrics vs actual user perception
    • Measuring technical metrics (see e.g. RED post on hacks) is easy
    • Actual perception is hard
    • A very open problem indeed!
  • Summary – rewind, watch!
    • We want to know your story, tell our recruiter. Great pitch!

Private Calling at WhatsApp / Xi Deng @ Meta

  • Again, giving a roadmap and mission statement is great!
  • 15 billion minutes talking on whatsapp each day…
    • Remember the 2018 “3 billion monthly” for Chrome?
    • One wonders how they compare to the largest telcos in the world
  • Great definition of “privacy” when it comes to calling. Metadata? Such a pun!
  • Interesting threat scenario
    • “no trust to faceless corporations” (how meta can Meta be?)
    • Do not leak location (or IP) to strangers. Zoom auto accept anyone?
    • Multi-device messaging and calling is a hard problem
  • Conflict for using data to improve service
    • What metrics are sensitive and which ones can you use to improve?
  • Private 1:1 calls
    • Pass-through servers seem like a relic of Whatsapp starting with XMPP as a protocol back in the day
    • Multi-device diverges from modern XMPP though
    • See also later slide on challenges of client-centric multi device
  • Decoupled relay server
    • The Whatsapp stack seems still different from the Messenger one and not using “standard” terminology even
    • Electing a common relay server seems wrong. ICE does not require that
    • Whatsapp seems to use a relay-first approach with opportunistic P2P4121
    • Disabling P2P for “strangers” is a very good practice
  • E2EE for media content
    • SRTP RFC 3711 does not provide E2EE. “master secret” is a very specific SRTP term. This is equivalent to SDES (boo) but is protected by E2EE (using the Signal protocol) which makes it ok-ish…
    • Having to generate different master secrets for different devices seems bad compared to DTLS-SRTP
    • It is concerning that Whatsapp continues to use SDES effectively and does not consider DTLS-SRTP (with its small setup latency) to be a solution
    • Identity is already a problem for chat messages. One wonders what percentage of sessions have a verified identity
  • Audio-video switch
    • A classic example of signaling glare
    • Unclear why a distributed consent algorithm is needed
    • The use-case for “oh my phone is an actual phone and can not do video” is shrinking
  • Multiparty
    • In XMPP terms the “group call storage” would be a MUC room
    • Selecting the best SFU makes more sense here than for relay servers
    • Warp protocol might be a frame header in the RTP payload before the actual codec payload
    • Unclear why the “master secret” which is a SRTP term (and hence on the leg between client and SFU) needs to change when participants join or leave
  • Recruiting pitch at the end too!

Group Call End-to-End Encryption and the Challenges of Encrypting Large Calls / Abo-Talib Mahfoodh @ Meta

  • Highly relevant if you are looking at E2EE for WebRTC
  • And another session with a mission and roadmap!
  • Recap of the SFU architecture and what it means for encryption
  • Where does frame encryption happen in the client pipeline
    • libwebrtc provides the FrameEncryptorInterface and FrameDecryptorInterface since 2018 but no implementations. Insertable Streams could not reuse those sync interfaces
  • Key negotiation
    • Sender key vs session key approaches
    • Session key is weaker than E2EE and only protects from the SFU which is still relevant in some use-cases
    • Note that the sender key is symmetric and all receivers must know it to decrypt, but they could encrypt with it. This is not a problem since the receivers can not send media with the SSRCs of the sender so impersonation is not possible
    • Joining the call requires a ratchet operation (which is cheap)
    • Someone leaving the call requires a rekey which is O(n^2) so expensive
  • Scaling group call E2EE
    • How large do you need to scale at? A meeting with 100 participants is not “private” so session keys might be more appropriate
    • Prioritizing key exchange based on whether you are planning to send becomes important
    • Rekey is expensive and larger calls have a higher participant churn making this a hard problem. A small time window to batch this operation helps
    • Failure to deliver rekey messages is odd, signaling has to be reliable or something is wrong with your overall system
  • No recruiting pitch?!

Live QA

Want to try out WebRTC Insights?

What you are seeing here isn’t the run of the mill issue of a WebRTC insights newsletter. It wasn’t even intended. But it does show the effort and focus we put on everything WebRTC for our clients. Watching a five hour event twice and producing actionable notes is not an easy task. It changed our weekend plans but we ended up being very satisfied with the results if only for our own notes.

If your company is relying heavily on WebRTC, then you should at the very least try this out. Reach out to me via the form at the end of the WebRTC Insights landing page and I’ll send you a sample issue.


You may also like

Leave a Reply

Your email address will not be published. Required fields are marked

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}