Why Transcription Makes Mistakes: The Full Taxonomy
transcription accuracyerrorstechnical

Why Transcription Makes Mistakes: The Full Taxonomy

ConvertAudioToText TeamMay 26, 20268 min read

Even the best modern transcription engines produce wrong words. The question is not whether transcripts have errors but where the errors come from and what you can do about them. This post walks through every major category of transcription error, why it happens, and the practical mitigations that work.

The Five Big Categories

Transcription errors fall into five rough categories.

  1. Acoustic errors: the model misheard what was said.
  2. Linguistic errors: the model picked the wrong word from acoustically similar options.
  3. Speaker errors: words attributed to the wrong speaker.
  4. Boundary errors: words spliced incorrectly, missed at boundaries, or duplicated.
  5. Hallucination: words inserted that nobody said.

Each category has different causes and different fixes.

Acoustic Errors

The model misheard the audio. The actual sound was not clearly distinguishable from another sound, and the model picked the wrong interpretation.

Common Causes

Low signal-to-noise ratio. Background noise that masks the speech signal. Our deeper post on background noise transcription covers this in detail.

Microphone artifacts. Wind noise, plosives, distortion, clipping, dropouts. Each one removes information the model needs to disambiguate phonemes.

Bandwidth limitations. Phone audio compressed to 8 kHz loses high-frequency content that distinguishes certain phonemes. Recordings made through narrow-bandwidth codecs (older VOIP, some cellular calls) routinely produce more errors than wideband audio.

Distance to microphone. Audio recorded from across a room has more reverberation and a worse signal-to-noise ratio than audio recorded six to twelve inches from the mouth.

What to Do

Improve the recording. The single biggest accuracy upgrade is recording cleaner audio. A directional microphone, six to twelve inches from the speaker, in a quiet room produces dramatically better results than a built-in laptop mic from across a desk.

Run noise reduction before transcription. Tools like iZotope RX or Adobe Audition's noise reduction can recover meaningful audio from moderately noisy sources.

Accept that some audio is unrecoverable. A phone recording from ten years ago with bad codecs and background noise is not going to produce a clean transcript. For these cases, human transcription is often the only path.

Linguistic Errors

The model heard the audio correctly but chose the wrong word from the available options. The choice was constrained by the model's understanding of language, and it picked a plausible but incorrect option.

Common Patterns

Homophones. "Their" versus "there" versus "they're." "Affect" versus "effect." The model has to use context to disambiguate, and sometimes it gets it wrong, especially in short sentences with little surrounding context.

Proper nouns. Names of people, companies, places, products, and brands are underrepresented in general training data. A novel name often gets transcribed as the closest common-word equivalent. "Antoinette" becomes "Antoinette" or "Annet" or "Annette" depending on the model's training set.

Domain-specific vocabulary. Medical terms, legal jargon, technical acronyms, brand names in specialized industries. The model defaults to the more common word that sounds similar.

Numbers. "Three-fifty" versus "three hundred fifty" versus "350" versus "3:50." All four can be correct depending on context; the model picks one and is sometimes wrong.

Acronyms and initialisms. Letter strings that sound like words ("IT" versus "it," "USA" versus "USA"). The model has to detect the acronym intent from context.

What to Do

Use keyterm biasing or initial prompts. Both Whisper and Deepgram support biasing the model toward expected vocabulary. Provide a list of proper nouns, brand names, and domain terms before transcription.

Run a manual review pass on proper nouns. Even after biasing, names sometimes come back wrong. A find-and-replace pass through the transcript catches consistent misspellings quickly.

Use domain-specific tools when available. Specialized transcription services exist for medical, legal, and other vocabulary-heavy domains. Our broader post on AI medical scribes covers the medical case.

Speaker Errors

The transcription is correct but the speaker labels are wrong. Speaker A's words are attributed to Speaker B, or two speakers are merged into one, or one speaker is split into two.

Common Causes

Similar voices. Two speakers with similar pitch, accent, and speaking style end up close in the diarization model's embedding space. The clustering algorithm misassigns segments.

Short turns. A speaker who says only "yeah" or "right" before passing the conversation back gets very little embedding signal. These short turns are frequently misattributed.

Overlapping speech. When two speakers talk at once, the model has to pick one. Speaker labels often flip incorrectly during overlap. Our deeper post on handling overlapping speech covers this in detail.

Single-microphone recordings. When all speakers share one microphone, the diarization loses the channel-level separation signal that would otherwise help.

What to Do

Record with separate microphones when you can. Multi-track recordings with one mic per speaker make diarization essentially trivial.

Accept that diarization on single-mic multi-speaker recordings will need cleanup. Build a manual review step into your workflow.

Use specialized tools for diarization-heavy use cases. Our broader post on handling multiple speakers in AI covers the dedicated approaches.

Boundary Errors

Errors at the boundaries of utterances: the first or last word of a phrase is dropped, or two words are spliced together, or a word is duplicated across a segment boundary.

Common Causes

Voice activity detection edges. Aggressive VAD that clips the start or end of words. Our post on how VAD works covers the dedicated stage.

Chunk boundaries in models with fixed windows. Whisper processes audio in 30-second chunks, and words that straddle the boundary can be partially missed or duplicated.

Compound words and contractions. "Going to" versus "gonna," "want to" versus "wanna." The model sometimes splits or merges these incorrectly.

What to Do

Trust the timestamps with skepticism near segment boundaries. If a word lands very close to a chunk boundary, it may be partially miscaptured.

For long files, prefer tools that handle chunk boundaries gracefully. Most modern Whisper wrappers (faster-whisper, WhisperX) include overlap handling between chunks.

Run a quick listen-through of the start and end of each minute. These are the highest-likelihood spots for boundary errors.

Hallucination

The model produced text that nobody actually said. The audio at that timestamp was silent, music, or non-speech noise.

Common Causes

Long silences. The model produces filler text during stretches of silence, especially in audio with conversational pauses.

Background music. Songs with lyrics sometimes get transcribed; instrumental music sometimes produces phantom text.

Repetition loops. The model occasionally gets stuck in a loop, producing the same phrase repeatedly until something interrupts.

Training data leakage. Whisper has documented cases of producing phrases like "thanks for watching" or "subscribe to the channel" because its training data included many YouTube videos with those phrases.

What to Do

Use voice activity detection. A well-configured VAD prevents the model from ever seeing the silent or non-speech regions. Most modern transcription pipelines include VAD; the English transcription pipeline is configured this way.

Watch for obvious patterns. If your transcript ends with "thanks for watching" but the audio is a podcast, that is a hallucination. Find and remove.

Catch repetition loops. If a phrase repeats more than twice in a row, that is almost certainly a loop. Truncate.

Combining Categories

Real-world errors often span multiple categories. A noisy recording of two speakers with overlapping speech and domain-specific vocabulary produces errors of all five types simultaneously. The mitigations stack: better audio, biasing, multi-track recording, VAD, and manual review.

The honest expectation for modern transcription:

Audio conditionExpected accuracy
Clean studio, single speaker, common vocabulary96 to 98 percent
Decent home office, single speaker, common vocabulary92 to 96 percent
Multi-speaker meeting, professional mics88 to 94 percent
Phone or conference call audio82 to 90 percent
Field recording or messy multi-speaker70 to 85 percent
Heavy accents on degraded audioBelow 70 percent in worst cases

Our broader breakdown of accuracy across conditions is in the transcription pricing comparison and transcription for accented English posts.

Using Confidence to Triage Errors

The most efficient approach to catching errors is to focus on low-confidence words rather than reading the full transcript. Our post on transcription confidence scores covers how to use these signals in your review workflow.

Where to Start

If you have a transcript that has obvious errors, classify them by category. If most of the errors are acoustic, work on recording quality. If most are linguistic, work on biasing and vocabulary lists. If most are speaker errors, work on multi-track recording. The mitigations are different for each category. Running a baseline through the 60-minute CATT free tier on your typical audio is the fastest way to see which error category dominates for your specific use case.

Try transcription free

Convert any audio or video to accurate text in seconds. Speaker labels, timestamps, and AI summaries included. No account required.

Related Articles