code-switchingmultilingualtranscriptionfix

Fixing Code-Switching Errors in Transcription

BMMamane B. MoussaMay 26, 2026Updated July 2, 202612 min read

Summarize this article with:

Why Mixed-Language Audio Breaks

The fix for most code-switching failures is picking an engine with a native multilingual model and telling it not to anchor to a single language. If you are using AssemblyAI, pass speech_model: "universal" and set your two language codes. If you are using Deepgram, set language=multi with Nova-3 or Flux. If you are using Whisper, set language=None to allow per-segment detection. The rest of this post explains why those choices matter and what to do when they still fail.

Most transcription engines are built around a single-language assumption. They receive audio, pick a primary language, and map every phoneme against that language's vocabulary. When a speaker switches languages mid-sentence, the model faces three bad options: render the secondary language as phonetic approximations of the primary one ("para crear" becomes "para crayer"), lock onto the new language and stay there even after the speaker switches back, or drop the mismatched segments entirely. Each produces a broken transcript.

The deeper problem is at language boundaries specifically. Research on multilingual speech models shows that switch-point WER (the error rate measured in the 2-3 word window around each language transition) can run 30-50 points higher than the overall blended WER. A model that scores 10% WER overall can still mangle every code-switch. Blended accuracy numbers hide this.

The Three Failure Modes, Named

Understanding which failure mode you are hitting tells you which fix to apply.

Phonetic substitution: the engine stays in the primary language and renders secondary-language words as near-homophones. This is the most common failure and the easiest to spot. "Finalizando" becomes "finally sando." Fix: switch to a multilingual engine.

Language lock: the engine detects the switch and commits to the new language, then fails to switch back. The rest of the transcript is in the wrong language. Fix: use per-segment detection rather than per-job detection.

Boundary hallucination: the model loses context at a transition point and generates plausible-sounding text that was never spoken. This is the hardest failure to catch because the output looks clean but is wrong. It is especially common with Whisper on short audio segments where silence or noise falls at a language boundary. Fix: use VAD (voice activity detection) preprocessing to remove silent segments before they reach the model, and prefer purpose-built code-switch engines over general multilingual models for switch-heavy content.

Fix 1: Use a Purpose-Built Code-Switch Engine

The clearest advance in this space since 2024 is that several major APIs now offer dedicated code-switch modes, not just "multilingual support."

Deepgram Nova-3 with language=multi supports code switching across 10 languages: English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch. Set language=multi in the query string when calling /listen. For streaming, Deepgram recommends endpointing=100 (100ms endpoint detection) specifically for code-switched audio. Nova-3 Multilingual delivered a roughly 34% relative reduction in batch WER in its March 2026 update, with the largest gains at code-switch boundaries.

Deepgram Flux Multilingual (flux-general-multi), released in general availability in April 2026, supports the same 10 languages with native code-switching built into the model architecture rather than as a detection layer on top of a monolingual model. You can pass optional language_hint parameters to bias detection when you know which languages are present.

AssemblyAI Universal-3 Pro handles code switching for pre-recorded audio across English, Spanish, Portuguese, French, German, and Italian. The constraint to know: you can specify a maximum of two language codes per transcription request, and one of them must be English. The non-English language should be the dominant one in the audio for best results.

For real-time streaming specifically, AssemblyAI's Universal-Streaming model processes six languages in a single forward pass (English, Spanish, French, German, Italian, Portuguese) without requiring manual language specification.

For language pairs outside these sets, Whisper-based tools remain the broadest fallback. See Fix 2.

See also: Deepgram Nova-3 explained for a deeper look at the multilingual model architecture.

Fix 2: Use Whisper with Auto-Detection for Unsupported Pairs

For language pairs not covered by the purpose-built engines above (Wolof + French, Cantonese + English, Taglish, Singlish, Portunhol, and others), Whisper Large-v3 is the most practical option because its training data covered a broader range of language combinations.

The critical setting: set language=None to allow per-segment detection rather than forcing a single language code.

result = model.transcribe(
    audio_file,
    language=None,  # auto-detect per segment
    task="transcribe"
)

Setting a specific language code forces the model to interpret all audio as that language. Setting it to None lets Whisper estimate the language per 30-second window.

The honest caveat: Whisper's code-switch accuracy degrades at language boundaries, and it is more prone to hallucination at those points than the purpose-built engines. For Hinglish specifically, no commercial transcription tool handles it reliably in 2026. Whisper is the best available option for Hindi-English code switching, but expect to do more manual correction than you would for Spanglish or French-English content where training data is heavier.

Audio uploaded to ConvertAudioToText's multilingual transcription tool

Fix 3: Specify Languages Explicitly Where Supported

For engines that support a language list rather than a single code, explicitly naming the languages present in the audio helps the model focus its detection.

# Google Cloud Speech-to-Text (Chirp 3, V2 API)
config = {
    "model": "chirp_3",
    "language_codes": ["es-US", "en-US"]  # up to 3 languages; fewer = more accurate
}

Google's documentation notes that specifying fewer languages increases detection accuracy. The V2 API also supports language_codes: ["auto"] on Chirp 3 for fully automatic detection, though explicit codes outperform auto-detection when you know the language pair.

For Google Cloud STT, the feature is available in the global region and the us and eu multi-regions only.

Code-Switching Patterns and Engine Choices

The right engine depends on which languages are switching. Here is a practical breakdown by common pair.

Language Pair	Best Engine	Notes
Spanglish (Spanish + English)	Deepgram Nova-3 `language=multi`, AssemblyAI U3-Pro, Whisper	Three solid options; all have heavy training data for both languages
Hinglish (Hindi + English)	Deepgram Nova-3 `language=multi`, Whisper (auto)	Hindi is in Deepgram's 10-language set; Whisper fallback for edge cases
French + Arabic	Whisper (auto)	Neither Deepgram multi nor AssemblyAI U3-Pro covers Arabic in code-switch mode
Taglish (Tagalog + English)	Whisper (auto), Google Cloud STT	No purpose-built code-switch mode for Tagalog
Wolof + French	Whisper (auto)	Whisper is the only realistic option; expect more correction
Mandarin/Cantonese + English	Whisper (auto)	Cantonese has less training data than Mandarin; expect higher error rate
Portunhol (Portuguese + Spanish)	Deepgram Nova-3 `language=multi`, Whisper	Both languages in Deepgram's set
German + English	Deepgram Nova-3, AssemblyAI U3-Pro, Flux	All three engines cover this pair

For multilingual meeting recordings where you need speaker separation alongside language detection, see speaker diarization explained and multilingual meeting transcription.

AWS Transcribe: Multi-Language Identification vs. Code-Switching

AWS Transcribe added multi-language identification for both batch and streaming jobs. The feature detects dominant languages per segment and labels them in the transcript output. It can identify "bilingual speakers who alternate between languages, such as US English and Hindi-IN, by identifying and transcribing each language separately."

The important distinction: this is language identification and per-segment transcription, not mid-sentence code-switching. For structured alternation (speaker A answers in English, speaker B responds in Spanish), it works well. For intra-sentential mixing ("I was finalizando the deal"), it is less reliable because the switching happens within a single detection window.

Redaction and custom language models are not currently supported alongside multi-language identification.

Fix 4: Split the Audio for Predictable Switches

For recordings with structured language alternation rather than sentence-level mixing, splitting by language segment and transcribing each part separately produces the cleanest output.

This is worth the extra steps when: the content has predictable language boundaries (an interview where questions are in English, answers in Mandarin), the stakes are high enough to justify the extra work, or the language pair is not supported by a purpose-built engine.

The workflow: identify language boundaries (manually or with a language detection tool), cut into segments, transcribe each segment with the matching language setting, merge in sequence. For regular content with a predictable structure, this can be scripted.

Manual Cleanup for Code-Switched Transcripts

Even with the best engine choices, code-switched transcripts benefit from a targeted review pass.

Check switch-point words first. The words immediately before and after each language boundary are where most errors concentrate. Scan those positions before reading the full transcript.

Fix phonetic substitutions. Where the model rendered one language as an approximate phonetic match in the other language, the error is usually recognizable if you know both languages. "Para crayer" for "para crear," "I told my mom acha" for "I told my mom accha."

Verify proper nouns. Names of people, places, and brands that cross language boundaries often get rendered inconsistently or phonetically. A bilingual native speaker review is the most efficient check for high-stakes content.

Do not edit for reading flow at the expense of accuracy. Code-switched speech can read awkwardly in transcript form, but the transcript's job is to capture what was said, not to read smoothly. Flag awkward passages for human review rather than silently smoothing them.

For accuracy benchmarks across engines, see transcription accuracy explained.

When the Problem Is Not Code-Switching

Some audio gets misdiagnosed as a code-switching problem when it is actually something else.

Accented speech in one language: a speaker with a strong regional accent may trigger the model's language detection. This is not code-switching and the fix is different: use a model with strong accent robustness rather than a multilingual code-switch mode.

Loanwords and borrowings: a speaker saying "I ordered sushi at the izakaya" has not code-switched. Single loanwords embedded in otherwise-monolingual sentences are handled well by any modern engine. This becomes a transcription issue only when the loanwords are rare and the model renders them phonetically.

Structured bilingual content: if each utterance is cleanly one language (speaker A in English, speaker B in French), you have bilingual content rather than code-switching. Use multi-language identification (AWS Transcribe, Google Cloud STT) or per-speaker language assignment rather than a code-switch model.

The rule of thumb: if switching happens mid-sentence (mixed grammatical structure within one utterance), it is code-switching. If switching happens at utterance boundaries, it is bilingual content. The fix differs.

For recorded multilingual meetings, create meeting minutes from audio covers the end-to-end workflow including language handling.

My take: in 2025, the practical advice was "use Whisper and accept the limitations." In mid-2026, the picture is more differentiated. For the 10 languages covered by Deepgram's language=multi and the 6 languages in AssemblyAI's U3-Pro code-switch mode, purpose-built engines are now measurably better than Whisper at switch boundaries. Whisper is still the right choice for language pairs those engines do not cover, but it is no longer the default recommendation for everything. Match the engine to the language pair, not to the workflow.

If you need a quick transcript from multilingual audio without configuring an API, ConvertAudioToText handles auto language detection with no account required for the first run.

FAQ

What is the best transcription engine for Hinglish in 2026?

Deepgram Nova-3 with language=multi is the strongest option, as Hindi is included in its 10-language code-switching set. Whisper Large-v3 with language=None is a reasonable fallback. No current engine handles Hinglish with high accuracy across all regional accents; expect more manual correction for Hinglish than for Spanglish or French-English content. Claims of 80-90% accuracy for Hinglish are not verifiable against independent benchmarks.

How does AssemblyAI code switching differ from Deepgram's?

For pre-recorded audio, AssemblyAI Universal-3 Pro supports two language codes per request, one of which must be English, with best results when the non-English language is dominant. Deepgram Nova-3 with language=multi covers 10 languages without requiring English to be one of them, and does not impose a two-language limit. For real-time streaming, both offer dedicated multilingual streaming models (AssemblyAI Universal-Streaming for 6 languages, Deepgram Flux for 10 languages), and the gap narrows. Pick based on which language pairs you need.

Does AWS Transcribe support code switching mid-sentence?

AWS Transcribe's multi-language identification works well for structured bilingual content where each utterance is one language, but it is less reliable for intra-sentential code-switching (mixing within a single sentence). The feature detects the dominant language per audio segment and transcribes each segment separately. For mid-sentence switching, Deepgram Nova-3 or AssemblyAI Universal-3 Pro are better choices if your language pairs overlap.

Why does Whisper hallucinate at language boundaries?

Whisper's decoder generates text based on learned patterns from training audio. When audio content is ambiguous at a language switch point (especially near silence or background noise), the model defaults to statistically likely continuations rather than the actual speech, producing plausible-sounding fabricated text. Preprocessing with voice activity detection (VAD) removes the silence segments most likely to trigger this. For switch-heavy content, purpose-built code-switch models are less prone to boundary hallucination because their architecture handles the switch signal explicitly rather than via pattern continuation.

Sources

AssemblyAI Code Switching documentation: https://www.assemblyai.com/docs/pre-recorded-audio/code-switching
AssemblyAI Universal-Streaming multilingual blog: https://www.assemblyai.com/blog/real-time-transcription-code-switches-multilingual-speakers
Deepgram Multilingual Code Switching documentation: https://developers.deepgram.com/docs/multilingual-code-switching
Deepgram Flux Multilingual launch (April 2026): https://deepgram.com/learn/deepgram-launches-flux-multilingual-press-release
Deepgram Nova-3 Multilingual WER update (March 2026): https://deepgram.com/learn/nova-3-multilingual-major-wer-improvements-across-languages
Google Cloud Speech-to-Text multiple languages documentation: https://docs.cloud.google.com/speech-to-text/docs/multiple-languages
AWS Transcribe language identification documentation: https://docs.aws.amazon.com/transcribe/latest/dg/lang-id.html

Try transcription free

Convert any audio or video to clean, unwatermarked text — speaker labels, timestamps, and AI summaries included. First 30 minutes free, no account.

foreign-wordsmultilingual

Fix Foreign Words in Your Transcription (2026 Guide)

Your AI transcript mangled every French phrase and German place name. Here is the systematic fix for foreign words in English transcripts, from tool choice to custom vocabulary.

May 26, 202610 min

punctuationtranscription

Fix Missing Punctuation in Your Transcript (2026 Guide)

Your transcript is a wall of text with no periods or commas. This guide explains why punctuation goes missing and how to fix it fast, with per-engine honesty.

May 26, 20269 min