transcription accuracyerrorstechnical

Why Transcription Makes Mistakes: The Failure Modes Explained

BMMamane B. MoussaMay 26, 2026Updated July 2, 202611 min read

Summarize this article with:

The Short Answer

AI transcription errors fall into five distinct categories, each with a different root cause. Knowing which category a mistake belongs to tells you where to focus the fix. Random-looking errors usually are not random at all.

Upload a sample of your real audio to see which failure modes apply

Acoustic Errors: The Model Misheard

The underlying cause is a degraded audio signal. When the sound reaching the model is ambiguous, the model guesses, and it sometimes guesses wrong.

The most common acoustic culprits:

Narrowband phone audio. Standard telephony encodes speech at 8 kHz, cutting off most of the frequency content above 4 kHz. A 2025 benchmark on real call-center recordings found that even the best system achieved only 87.7% accuracy on that audio, compared to the 95-99% routinely seen on wideband recordings. The missing high frequencies are exactly what distinguishes certain fricatives and stops from each other.

Background noise. Noise masks the speech signal and lowers the signal-to-noise ratio. What the model receives is a blend of voices and environment, not a clean voice. See dealing with background noise in transcription for the technical options.

Distance and reverberation. A microphone across a room picks up a reverb smear of the original sound. By the time the waveform reaches the model, consonant edges are blurred and short words disappear into room reflections.

Microphone artifacts. Plosive pops, wind noise, clipping, and dropout each remove information the model needs to distinguish phonemes. A clipped waveform is unrecoverable; the model sees a flat line where a consonant should be.

My take: acoustic errors are almost always fixable at the source. A directional microphone at close range in a quiet room is a bigger accuracy upgrade than switching transcription providers.

Linguistic Errors: The Model Heard Right but Chose Wrong

The model received the audio correctly but selected the wrong word from acoustically similar options. This is a language-model failure, not a signal failure.

Homophones

"Their," "there," and "they're" sound identical. The model uses surrounding context to pick one, and with short sentences or fast speaker turns, that context is thin. The same applies to "affect" versus "effect," "principal" versus "principle," and hundreds of similar pairs. The choice is probabilistic, and the model is sometimes confidently wrong.

Out-of-Vocabulary (OOV) Words

This is one of the most consistent failure modes. Every transcription model has a trained vocabulary. Words outside that vocabulary, most often proper nouns, brand names, uncommon surnames, and specialized terms, get misrecognized as the nearest in-vocabulary equivalent. The model cannot output what it was never trained to produce.

A company named "Atrion" comes back as "Adrian." A drug name like "Dupixent" becomes "duplex aunt." A founder's name gets rendered as the closest common word. OOV errors hurt precisely where accuracy matters most, because names and brand terms carry meaning that a generic substitution destroys.

Practical mitigation: both Deepgram and Whisper-based systems support vocabulary biasing or initial prompts. Feeding a list of expected proper nouns before transcription substantially reduces OOV misses.

Numbers and Formats

"Three fifty," "three hundred fifty," "350," and "3:50" are all plausible interpretations of the same utterance depending on context. The model picks one and often picks the wrong register. Technical contexts where numbers are dense, such as legal documents, financial calls, or scientific lectures, tend to produce a higher number error rate.

Domain Vocabulary

Medical, legal, scientific, and technical fields all have jargon that is underrepresented in general training data. "Dysarthria" gets rendered as "this arthria." "Fiduciary" occasionally comes back as "field theory." The model defaults to the more common word that occupies a similar acoustic slot.

For more on the mechanics of how models choose between competing candidates, see transcription accuracy explained.

Speaker Errors: Right Words, Wrong Person

Diarization assigns words to speakers. When it fails, the transcript is correct but the attribution is wrong. Real-world single-channel diarization error rates run between 10-18% in unsegmented audio, according to research published in 2025.

Why Voices Get Confused

Speaker diarization works by extracting a voice embedding for each speaker segment and clustering them. Two speakers with similar pitch, speaking rate, and accent land close together in that embedding space, and the clustering algorithm misassigns segments.

Short turns make it worse. When someone says only "right" or "mm-hmm" before passing the floor back, the model has almost no signal to anchor that speaker identity. Those one-word backchannels are frequently misattributed.

Crosstalk: The Hardest Case

When two people speak simultaneously on a single microphone, their acoustic features merge into one waveform, and the diarization algorithm cannot separate them mathematically. The result is either a dropped speaker or a phantom new speaker that does not exist in reality. Overlap-aware diarization systems explicitly detect these regions and flag them, which is better than silent misattribution, but the underlying content is still ambiguous.

Multichannel recording, one microphone per speaker, solves this cleanly. The channel separation gives the model a signal that single-mic diarization cannot reproduce.

See speaker diarization explained for a fuller breakdown of how the clustering works.

Boundary Errors: Correct Words, Wrong Positions

Words at the edges of audio segments are consistently the highest-risk position for errors. Two mechanisms cause this.

Voice Activity Detection Clipping

Most transcription pipelines use voice activity detection (VAD) to identify speech regions and skip silence. Aggressive VAD triggers slightly too early or too late, clipping the first phoneme of an utterance or the last consonant of a sentence. Short words, "yes," "no," "and," disappear entirely. The error is invisible because there is no wrong word in the transcript; the correct word is simply absent.

Chunk Boundary Artifacts in Whisper

Whisper processes long audio in 30-second windows, shifting forward based on predicted timestamps. Words that straddle a chunk boundary are processed twice, once at the end of one window and once at the beginning of the next. The result is word duplication, word omission, or a splice that merges two distinct words. This is a documented behavior in Whisper's long-form transcription pipeline, not an edge case.

Tools like WhisperX add a dedicated alignment step that re-anchors word timestamps against the audio signal after decoding, which reduces, but does not eliminate, these boundary artifacts.

When reviewing a transcript, the segment-boundary positions, roughly every 30 seconds, are the highest-probability locations for this class of error.

Hallucination: Words Nobody Said

The model produced text with no corresponding audio. This is qualitatively different from the other categories because there is no real input that was misinterpreted; the model invented output from nothing.

Long Silences and Non-Speech Audio

Whisper is extensively documented to generate text during silences, background music, or non-speech noise. The specific hallucinations are not random: because Whisper was trained on approximately 680,000 hours of web audio that included large quantities of YouTube content, it has learned to associate silence near the end of a recording with video outro phrases. Outputs like "Thank you for watching" or "Subscribe to my channel" appearing in a transcript of a legal deposition are Whisper inserting text it learned from YouTube outro scripts, not text anyone actually spoke.

Repetition Loops

A separate and recognized failure mode is the repetition loop: the decoder gets stuck generating the same phrase repeatedly until something in the audio provides a new anchor. Research has documented this behavior in 14 of 19 languages audited. The pattern is visually obvious, a phrase repeating five or ten times in sequence, and should be treated as a signal to truncate that section.

Training Data Leakage

Beyond YouTube phrases, Whisper has been shown to produce phrases that appear to come from its training data rather than the input audio. These are difficult to detect automatically because they are grammatically plausible. Manual spot-checks of transcripts from silent or very quiet audio regions are the most reliable way to catch them.

One mitigation: VAD before the model, so the model never sees the silent regions that trigger hallucination. Most well-configured transcription pipelines do this by default.

How the Categories Combine

A real recording almost never fails in just one category. A podcast with two hosts who occasionally talk over each other, recorded with laptop mics in a room with HVAC noise, will produce acoustic errors, diarization failures during crosstalk, OOV errors when guests mention obscure names, and hallucination during the pre-episode silence. The error profile is additive.

The most efficient diagnostic is to take ten errors from your transcript, classify each one by category, and see which category dominates. That tells you where to invest the fix effort.

Audio condition	Typical accuracy range
Studio recording, single speaker, common vocabulary	95 to 99 percent
Home office, single speaker, decent microphone	90 to 96 percent
Video conference, professional setup, multi-speaker	85 to 92 percent
Phone or VoIP call, narrowband audio	80 to 88 percent
Field recording, multi-speaker, ambient noise	70 to 85 percent
Heavy accent on degraded audio	Below 70 percent in worst cases

Accuracy ranges sourced from AssemblyAI's 2026 benchmark and Voicegain's 2025 call-center study. Individual results vary by engine and audio.

Where to Go for Fixes

This post explains the mechanisms. For the practical mitigations:

Audio quality and microphone choices: microphone tips for clear transcription and improve audio quality before transcription
Noise reduction techniques: dealing with background noise in transcription
Getting more accurate output from noisy files: transcribe with poor audio quality
Why accuracy numbers vary between services: transcription accuracy explained

If you want to run your own audio through a test before investing time in fixes, ConvertAudioToText's free tier lets you upload up to 10 minutes without an account. The result will show you which error category is dominant for your specific file.

Frequently Asked Questions

Why does my transcription say "thanks for watching" when nobody said that?

This is a documented Whisper hallucination tied to its training data. Whisper learned from roughly 680,000 hours of web audio, a large fraction of which came from captioned YouTube videos. The model associates silence or non-speech noise with the typical endings of YouTube videos and generates those phrases as output. It is not transcribing something faint in the background; it is pattern-matching on the absence of speech.

Why are proper names almost always wrong in my transcripts?

Proper nouns, surnames, brand names, and unusual first names are underrepresented in general training data. When the model encounters a name it has rarely or never seen, it substitutes the closest in-vocabulary word that fits the acoustic pattern. This is the out-of-vocabulary (OOV) problem, and it is one of the most consistent failure modes across all transcription engines. Vocabulary biasing or an initial prompt listing expected names is the direct fix.

Why does the same speaker sometimes get labeled as two different people?

Speaker diarization works by clustering voice embeddings. Short utterances, single words or backchannels, do not carry enough acoustic signal to anchor a speaker identity reliably. The model assigns them to whichever cluster is nearest at that moment, which may not be the correct speaker. This is especially common during fast exchanges where short responses alternate quickly.

Can AI transcription ever hallucinate on clearly spoken audio?

Yes, but it is rare. The more common scenario is hallucination during silence, music, or non-speech noise. On clearly spoken audio, the most likely errors are substitutions (wrong word chosen) rather than insertions (word invented). True hallucination on clean speech does happen, particularly with repetition loops where the decoder gets stuck, but it is a different mechanism from the silence-triggered kind.

Is there anything I can do about transcription errors caused by overlapping speakers?

The most effective solution is recording with separate microphones, one per speaker. Multichannel audio eliminates the waveform blending that makes crosstalk impossible to diarize correctly. If that is not practical, most modern diarization systems have an overlap detection mode that at least flags the ambiguous segments instead of making a silent wrong attribution. Post-edit those flagged segments manually.

Sources

AssemblyAI, "How accurate is speech-to-text in 2026?" https://www.assemblyai.com/blog/how-accurate-speech-to-text
Voicegain, "2025 Speech-to-Text Accuracy Benchmark for 8 kHz Call Center Audio Files" https://www.voicegain.ai/post/2025-speech-to-text-accuracy-benchmark-for-8-khz-call-center-audio-files
TechCrunch, "OpenAI's Whisper transcription tool has hallucination issues, researchers say" https://techcrunch.com/2024/10/26/openais-whisper-transcription-tool-has-hallucination-issues-researchers-say/
Gladia, "AI Model Biases: What went wrong with Whisper by OpenAI?" https://www.gladia.io/blog/ai-model-biases-what-went-wrong-with-whisper-by-openai
AssemblyAI, "What is speaker diarization and how does it work?" https://www.assemblyai.com/blog/what-is-speaker-diarization-and-how-does-it-work
arxiv.org, "Investigation of Whisper ASR Hallucinations Induced by Non-Speech Audio" https://arxiv.org/html/2501.11378v1
Kerson AI Solutions, "Accent Bias in Speech Recognition: Challenges, Impacts, and Solutions" https://kerson.ai/research/accent-bias-in-speech-recognition-challenges-impacts-and-solutions/

Try transcription free

Convert any audio or video to clean, unwatermarked text — speaker labels, timestamps, and AI summaries included. First 30 minutes free, no account.

speech recognitiontechnical

Acoustic Models vs Language Models in Speech Recognition

What acoustic models and language models do in ASR, why the split mattered historically, how end-to-end systems absorbed it, and why it still explains the errors you see today.

May 26, 202611 min

jargontechnical

Fix Jargon Errors in Transcription: The Glossary Pass

Your transcript turned "Kubernetes" into "cuban itties." Here is the systematic fix for technical jargon in AI transcripts, from quick find-replace to custom vocabulary APIs.