Voice Activity Detection is used by voice codecs to detect when nothing is being said – when there is no speech worth encoding and sending. This is used in VoIP as a simple compression mechanism, where we don’t waste resources on low volumes or content that isn’t relevant.
VAD algorithms need to separate speech from noise, accurately measuring when noise is being captured by the microphone. That noise (or low volume level with no speech in it) can then be removed and replaced with comfort noise. VAD is also used to decide when to send DTX packets.
Another more recent use of VAD is for turn detection algorithms when having a conversation with an AI agent. Here, there is a need to find the end of a prompt when converting speech to text and passing it to a generative AI algorithm. In most cases, a different VAD than the one used in codecs is utilized. One that is more accurate but also more resource intensive, necessitating the use of it on servers instead of on the user’s device.