WebRTC is the backbone of Voice AI, yet our analysis reveals that not all voice bot implementations are created equal; read on to learn the essential best practices for echo cancellation, Opus bitrates, and network configuration to ensure your bot works for everyone.

Voice AI is THE thing today. The new interface that will solve all of our troubles. Like HAL 9000 from Space Odyssey, we’re all going to chat ourselves to death with our favorite LLM friend.
Now… Voice AI done right requires WebRTC today. Why?
- Because it just works
- It offers low latency
- The best AI frameworks for Voice AI are built on top and with WebRTC infrastructure. And they are open source
In the past few months we’ve been looking at the webrtc-internals dump files of quite a few Voice AI services using rtcStats. We found some interesting implementation decisions and places where there’s room for improvements. All that led to this article, focusing on best practices in using WebRTC for Voice AI.
Key Takeaways
- Voice AI requires proper use of WebRTC to ensure low latency and reliable connections
- Implementations of Voice AI using WebRTC can vary greatly; some common pitfalls include unnecessary renegotiations and improper TURN server configurations
- It’s important to enable echo cancellation and configure Opus bitrates correctly to avoid wasting resources and degrade audio quality
- Avoid using dual peer connections and negotiating video channels when unnecessary to simplify the architecture and reduce complexities
- To optimize your Voice AI, use rtcStats for diagnostics or seek expert consulting to refine your implementation
Table of contents
- Voice AI WebRTC implementations aren’t equal
- Single peer connection vs dual peer connection
- Unnecessary renegotiations
- TURN servers use
- Echo cancellation considerations
- Noise suppression
- Opus bitrates
- Silence management and bitrate
- Disable DTX
- Managing the AI’s volume level
- Video channels
- How many data channels?
- Where are audio quality algorithms needed
- Is there a need for an SFU?
- How’s your Voice AI implementation doing?
Voice AI WebRTC implementations aren’t equal

Here’s something we found surprising when we started looking at services: they aren’t created equal.
At times, WebRTC may seem like a monolith of a thing with a single path of implementation for everything, but that’s far from true these days. After almost 15 years, WebRTC has grown in richness, capabilities and available options. The tools embodied inside WebRTC are optional to use and it is your role as a developer to pick and choose the ones for your application.
In the next sections, I will review some of the design decisions we’ve seen, giving my own opinion about what’s best and why.
Single peer connection vs dual peer connection

😐
Many of the services use 2 peer connections instead of 1.
For me, a 1:1 use case doesn’t need more than a single peer connection for voice and video communications.
That said, in many cases, we see 2 separate peer connections in the implementation. One incoming and one outgoing. Why is that? Most likely because the infrastructure used was originally built and designed for group meetings, where at times it makes sense to split between the two.
My preference? Use a single peer connection for a person to bot scenario. Feel free to use more in group scenarios – as you would with group meetings.
The real harm in having 2 peer connections? Dealing with failures (on one of the connections or both) and making it harder to debug when there are issues – and there are always issues…
👉 From the rtcStats Showcase: Simli for example, uses 2 peer connections whereas UneeQ uses 1 peer connection
Unnecessary renegotiations

For some reason, quite a few of the Voice AI services open an empty peer connection, and then renegotiate it multiple times until they get just what they need: an audio stream and data channels.
This is likely due to the use and reliance of group conferencing technologies and connecting through an SFU, but in many ways is useless for Voice AI. When the only thing you speak to is a bot, a single offer answer round is enough – there’s no need to renegotiate and no one else is going to “barge” in and join this “meeting.
Again – these layers just add up complexity where none is needed.
My suggestion? Keep things simple where they are supposed to be simple.
👉 From the rtcStats Showcase: ElevenLabs renegotiates streams multiple times
TURN servers use

This one is just going back to basics. It seems like many Voice AI vendors are clueless about WebRTC, networking and NAT traversal. So the end result is quite a few instances where we’ve seen improper use of TURN servers and their configuration.
What needs to be done here is to go back to basics and the usual best practices of TURN configuration.
Here are things we’ve seen go wrong for Voice AI services:
- Not having any TURN servers at all (while this might be reasonable for some rate use cases, for the most part, this just means less people being able to connect to your Voice AI interface). TURN can be discarded by those using ICE-TCP, though that’s advanced and doesn’t seem to be deployed by Voice AI vendors yet
- Having a configuration with only TURN/UDP or only TURN/TCP or similar. Better to have all 3 alternatives: UDP, TCP and TLS
- Configuring more than a single STUN server. Guess what? You don’t even need a single STUN server configured if you configure a TURN/UDP server…
- Getting connected to a distant TURN server. In some cases, I got connected via a US TURN server. I am located in Israel. Check your deployment and be sure to have TURN servers where your users are and that these TURN servers actually get allocated for the relevant sessions
👉 From the rtcStats Showcase: No TURN, no TURN TCP, no TURN TLS, wrong location, …
Echo cancellation considerations

In too many cases, we noticed that echo cancellation is disabled in the peer connection’s configuration for WebRTC.
Frankly? It is hard to find a good reason for doing this.
Talking to a voice bot still requires incoming audio and outgoing audio. Your users might be using separate microphones and speakers, where good acoustic echo cancellation is needed.
If there’s any echo spilling out into the audio, your speech to text service might be hard pressed to figure out what is being said – and you definitely don’t want that.
Check if you disable echo cancellation and if you do – answer to yourself why are you doing that.
👉 From the rtcStats Showcase: Tavus disables AES
Noise suppression

When it comes to Voice AI, noise suppression is something that is done on the server side oftentimes. You can see Krisp being added on the server side in many such cases.
In many ways, noise suppression can be added on the client side as well with WebRTC. Krisp has a similar client-side implementation available, and so do others.
Doing this client-side can save a bit on the computing power you’re spending on the servers.
Which approach to take here? That’s up to you, but I wanted to share this option as well.
Opus bitrates

Same as with volume level, you decide on the server how to configure your Opus codec.
And it is Opus – we haven’t seen anyone use G.711 on the WebRTC-leg of their voice AI application.
What’s interesting is getting 30, 60 or even higher kbps for incoming audio over Opus from a Voice AI service over WebRTC. Why? Because this is a waste of resources:
- Your service uses higher bitrate (and someone is paying for that bandwidth use)
- Constrained networks may have a hard time sending that bitrate towards the user
Our approach? Keep it low. No need to send more than WebRTC is sending via Opus by default if your Voice AI service isn’t an opera singer.
Be sure to also check your outgoing bitrate and to ask yourself if you’re sending too much there as well for some reason.
👉 From the rtcStats Showcase: High audio bitrates; extremely high audio bitrate
Silence management and bitrate

When people don’t speak, audio codecs have the option of marking it over the network as silence. Silence packets are smaller, making for lower bitrate use.
Interestingly, not many Voice AI services end up encoding with variable bitrate the speech that the bot generates. This means you’re getting “high” bitrate even when the Voice AI service is actually in listening mode and not generating any real audio of its own.
Me? I am against waste. Especially when that waste is a single configuration parameter away from being resolved…
👉 From the rtcStats Showcase: with silence management
Disable DTX

Using DTX in group video calling is great. It reduces the number of packets sent from the media server in silence towards the users, which in turn reduces the network load on the server. It works great especially since in large groups, there’s one person speaker while others are just listening.
In some cases, the use of DTX may even increase the latency of the response you will be receiving from the bot – something you definitely don’t want to occur.
For a 1:1 session between a human and his bot it is usually best to leave voice activity detection to the server where more advanced methods can be used.
👉 From the rtcStats Showcase: DTX is enabled
Managing the AI’s volume level

The media pipeline in a WebRTC voice bot goes something like this:
Human → Microphone → WebRTC → Network → Speech to Text → LLM → Text to Speech → WebRTC → Network → Speaker → Human
When you generate your own speech from the text, you also decide what volume to use prior to encoding that audio signal.
In some cases, the volume level was higher than the default/average we usually expect on WebRTC. That’s because WebRTC tries to normalize volume levels captured by the microphone before encoding them.
You have a decision to make here – do you want:
- Use a volume level that is lower than the default, making it hard for people to hear what’s being said
- Try keeping the normalized volume level used by WebRTC
- Go for a higher volume level
When having a higher volume level, on one hand, I can hear better what’s being said. On another level, it might be too high a volume, akin to shouting or opening the car radio just to get bombarded by a high volume level.
Figure out what’s your preference here. Me? I’d keep it to the normalized ones used by WebRTC.
Video channels

Here’s something interesting…
We saw Voice AI services that open up video channels. Where the scenario itself was pure voice – no video exchanged. In such a case, just don’t negotiate video channels. There’s no need for it.
Another thing we noticed for Voice AI services where video was actually shared from the user’s side (for vision related scenarios) – sometimes, it was done using simulcast. For the 1:1 nature of these scenarios – simulcast is likely hurting video quality and not improving it. Just don’t use simulcast if it isn’t needed (and it isn’t for 1:bot cases).
And if your AI service can’t handle or deal with variable video bitrates, then you should probably solve that if user experience is important to you.
👉 From the rtcStats Showcase: simulcast used, video resolution too high
How many data channels?

Some Voice AI services don’t open data channels. Others open quite a few of them.
It isn’t always apparent that these data channels are even needed – and oftentimes some of them weren’t used in our experiments.
Check which data channels you open, with what configurations and most importantly – why…
👉 From the rtcStats Showcase: no data channels, 1 data channel, 4 data channels
Where are audio quality algorithms needed

There are quite a few WebRTC media resilience algorithms and tools one can use.
From the human sender sound towards the LLM? They aren’t that important or interesting in that direction. The purpose isn’t for the audio to be easy for a person to listen to and understand but for the speech to text engine – and there, we have other algorithms to treat it at times.
From the bot to the human, using these things definitely make sense – that’s where the person listening is who we care about. So using whatever we usually do for human to human communication would make sense here as well.
Then there’s stereo… somehow, some Voice AI services negotiate stereo towards the user. Why would we need that for general purpose agentic use cases? I have no clue. These aren’t gaming services so the need likely doesn’t exist. Moreover, stereo takes more CPU to generate and more network to send – and many text to speech engines default to mono and not stereo anyway.
Pick and choose here what you really need, and don’t blindly select algorithms that “improve quality” – they might hurt you instead of help you.
Is there a need for an SFU?

SFUs are used for group calling and live streaming. Most Voice AI services that use WebRTC for some reason go through an SFU. The reason is that this is how Video API vendors have set things up. Call it legacy. Call it the way we do things around here.
Is that the best approach?
Probably it isn’t if you’re doing a 1:bot conversation. In such a case, the SFU is just another extra leg in the session. One which adds latency and another moving part to be worried about.
If you can work on an architecture that doesn’t need SFU it is likely a win for you.
👉 You are still likely to need a media server/gateway to “translate” WebRTC to WebSocket to fit into the speech to text engine as many of these work with a WebSocket interface.
How’s your Voice AI implementation doing?

Voice AI is the thing these days. Everywhere we go, that’s the main theme and discussion. Focus for many WebRTC builders have shifted towards Voice AI.
With this focus, there’s a plethora of solutions, demos, services and applications that use WebRTC in Voice AI services, but they aren’t created equal.
For Voice AI you’ll need to deal with prompts, conversation drifts, hallucinations, evals, and a lot of other aspects related to the use of generative AI.
The thing is, in many ways, we need to also take care of the basics of audio communications – make sure that the audio is sent and received in the best and most efficient way possible. Otherwise, our bot won’t work as expected (it will when you test it, but it won’t for some of your users).
The success of your Voice AI hinges on a stable, efficient audio transport layer. If you’ve identified potential pitfalls in TURN server use, echo cancellation, or Opus bitrates, it’s time to resolve your hidden implementation debt.
You have two ways to ensure your WebRTC foundation is optimized for efficiency and reliability:
- Self-diagnose with rtcStats: Use our open-source and freemium rtcstats.com combo to look “under the hood” of your WebRTC implementation and identify the critical optimizations needed. Start raising an eyebrow at your implementation today
- Expert consulting: For hands-on help reviewing and perfecting your service, we provide consulting to ensure your bot works for all your users
Ready to optimize? Reach out to me to learn more about consulting or start exploring your data with rtcstats.com.
