
How VAD (Voice Activity Detection) Works in Modern Transcription
Voice activity detection sits between the microphone and the transcription model, deciding which parts of the audio contain speech and which do not. It is one of the unglamorous pieces of the pipeline that almost nobody discusses, but it has outsized impact on transcript quality. This post explains what VAD does, how it works, and why it matters for the output you actually see.
What VAD Actually Does
A voice activity detector produces a yes-or-no label for every short slice of audio: speech is happening here, or it is not. The granularity is typically 10 to 100 milliseconds per decision.
The output is a timeline that looks like this:
- 0:00 to 0:03: silence
- 0:03 to 0:15: speech
- 0:15 to 0:18: silence
- 0:18 to 0:42: speech
- 0:42 to 0:45: silence (laughter or noise, not speech)
The transcription model only processes the speech segments. The silent or non-speech segments are skipped, which saves compute and produces cleaner output.
Why VAD Matters for Transcription Quality
Three concrete benefits.
Avoiding Hallucination
Modern transcription models trained on speech sometimes produce text during non-speech audio. The model has been trained that audio plus a prompt produces text, and it generates text even when there is nothing useful to transcribe. This is hallucination.
Whisper Large-v3 has documented cases of hallucinating phrases like "thanks for watching" or "subscribe to the channel" during silent stretches, because its training data included a lot of YouTube videos that ended with those phrases over background music. Without VAD, you can see these phantom texts in transcripts of audio that has nothing of the kind.
VAD prevents this by removing non-speech regions before the transcription model sees them. The model never has the chance to hallucinate over silence because the silence never reaches it.
Saving Compute
Transcribing 10 seconds of silence costs about as much as transcribing 10 seconds of speech, because the model still processes the full spectrogram. For long recordings with significant silence (a 90-minute lecture with 15 minutes of pauses), VAD can cut transcription time by 15 to 20 percent.
For consumer transcription tools this is a small concern. For high-volume production deployments (call centers, broadcast captioning) it adds up to real cost savings.
Cleaner Timestamps
When VAD marks regions as silent, the transcription model does not need to figure out timestamps in those regions. The result is more reliable word-level timing in the speech regions because the model is not anchoring its output to the wrong reference points.
How VAD Works Under the Hood
The technical approaches have evolved over the decades. Modern systems converge on a few common patterns.
Energy-Based VAD
The simplest possible VAD. Compute the energy (volume) of each audio frame; mark frames above a threshold as speech and below as silence.
This works for clean audio in quiet rooms. It fails as soon as there is background noise that is louder than the threshold, or speech that is quieter than the noise. Energy-based VAD is rarely used in modern transcription pipelines.
Spectral-Based VAD
Look at the spectral characteristics of each frame. Speech has a distinctive distribution of energy across frequencies; noise typically does not. By comparing the spectral shape of each frame to a learned speech model, you can distinguish speech from noise more reliably than with energy alone.
Webrtc-vad, an open-source VAD from Google's WebRTC project, uses this approach. It is fast, simple, and good enough for many use cases. It still misclassifies music-with-vocals as speech and can struggle with very noisy environments.
Neural VAD
The modern standard. A small neural network is trained on labeled audio (speech versus non-speech, with various background conditions) to predict the speech label for each frame.
Silero VAD, an open-source neural VAD released in 2020, is widely used in production pipelines including parts of the Whisper ecosystem. It handles noise, music, and edge cases significantly better than spectral approaches. The model is small enough to run in real time on CPU.
Larger neural VAD models exist for specialized use cases, but Silero-style models hit the sweet spot for general-purpose transcription.
VAD in the Whisper Pipeline
OpenAI's reference Whisper implementation does not include external VAD by default. The model has some internal VAD-like behavior (it can produce "no speech" tokens) but this is unreliable enough that most production pipelines run a separate VAD pass.
The typical Whisper-based pipeline looks like:
- Audio is fed through Silero VAD or similar.
- Speech regions are extracted as 30-second chunks (Whisper's window size).
- Each chunk is transcribed by Whisper Large-v3.
- The outputs are concatenated with timestamps adjusted to match the original audio timeline.
This is meaningfully more reliable than feeding Whisper the raw audio directly, especially for long files with significant silence. Our broader post on Whisper Large-v3 covers the model itself.
VAD in the Deepgram Pipeline
Deepgram Nova-3 includes built-in voice activity detection as part of the integrated end-to-end pipeline. Users do not need to add a separate VAD stage; the model handles it internally.
This is operationally simpler than the Whisper approach (one model, one configuration) but also less transparent. You cannot inspect the VAD output separately from the transcription output, which can make debugging difficult.
For most users the difference is invisible. For advanced use cases where you need to instrument the pipeline (research, custom integrations) the explicit VAD approach gives you more control. Our post on Deepgram Nova-3 covers the engine.
What VAD Cannot Do
A few common misconceptions worth flagging.
VAD does not identify speakers. It distinguishes speech from non-speech, not Speaker A from Speaker B. Speaker identification is a separate problem; see handling multiple speakers in AI.
VAD does not improve transcription accuracy on speech regions. It only changes which regions get transcribed. The model's accuracy on the regions it does process is unaffected by VAD (except indirectly, by removing distractor content from before and after).
VAD does not clean up background noise. It only labels regions; the actual audio is still noisy. For noise cleanup, see our post on dealing with background noise in transcription.
VAD does not differentiate types of non-speech. Laughter, music, environmental sound, and pure silence all typically get labeled as "not speech" with no distinction between them. Some specialized systems separate these, but standard VAD does not.
When VAD Helps Most
A few use cases where VAD has the biggest impact.
Long recordings with intermittent silence. Lectures with thinking pauses, meetings with discussion gaps, podcasts with ad breaks. The accuracy and cost benefit are largest here.
Voice memos and dictation. Short bursts of speech separated by silence are the worst case for hallucination without VAD; the model has nothing to anchor to during the silences.
Field recordings. Audio recorded in real-world environments often has long stretches of ambient sound between bursts of relevant speech. VAD makes these usable.
Multi-track recordings. When each speaker has their own track, that track has speech only when that speaker talks. VAD per track is essential to clean output.
Tuning VAD for Your Use Case
For most users, VAD is invisible because the pipeline ships with sensible defaults. For users who care about the tuning:
Sensitivity matters. Aggressive VAD (high speech threshold) skips more audio and risks dropping real speech that is quiet. Permissive VAD (low threshold) catches more speech but also lets through more non-speech that may cause hallucination.
Minimum segment length matters. Very short segments (under 200ms) often produce poor transcription output. A VAD configuration that merges nearby segments produces cleaner downstream results than one that processes every short speech burst independently.
Padding matters. Most VAD systems pad the detected speech regions slightly before and after, to avoid clipping the beginnings and ends of words. Without padding, you can lose the first phoneme of every utterance.
For users of consumer transcription tools, these knobs are not exposed and the defaults are usually fine. For developers integrating transcription into their own pipelines, these are the parameters that matter.
Where to Start
VAD is not something most users need to think about directly. If your transcripts have phantom text in silent stretches, or significant accuracy issues at the start and end of utterances, the underlying VAD is probably the issue. Running your audio through the 60-minute CATT free tier is a quick way to see whether the VAD-included pipeline produces clean output on your specific audio profile. The transcription confidence scores post covers what to look for in problematic regions, and the broader why transcription makes mistakes post covers the full failure mode taxonomy.
Try transcription free
Convert any audio or video to accurate text in seconds. Speaker labels, timestamps, and AI summaries included. No account required.
Related Articles

Acoustic Models vs Language Models in Speech Recognition
What acoustic models and language models do in speech recognition, why the distinction mattered historically, and why it has faded in modern systems.

Fix Transcription of Jargon: Technical Terms, Medical Vocabulary, Industry Acronyms
Your transcript renders 'Kubernetes' as 'cuban itties' and 'electrocardiogram' as 'electric ecotone gram.' Here is the systematic fix for technical jargon in transcripts.