AIspeech recognitiontechnology

How AI Speech Recognition Works: The Neural Pipeline (2026)

BMMamane B. MoussaApril 14, 2026Updated July 2, 202612 min read

Summarize this article with:

The Pipeline at a Glance

Modern AI speech recognition is an encoder-decoder neural network that reads a log-mel spectrogram of your audio and outputs text tokens, all in one pass with no separate acoustic model or pronunciation dictionary in between. The model learns the mapping from sound to words jointly, which is why accuracy has improved so sharply in the past five years. If you want the older hybrid-pipeline story (HMM, MFCC, separate acoustic and language models), see how speech recognition works. If you want the product-side view (what happens between your file upload and the transcript you download), see how AI transcription works. This post covers the modern neural architecture itself.

From Sound Wave to Model Input

Before any neural network sees your audio, the raw waveform has to become a format the model can process.

The standard input format is a log-mel spectrogram. The audio is divided into overlapping 25ms frames sampled every 10ms. Each frame is passed through a Short-Time Fourier Transform (STFT) to produce a frequency spectrum, which is then filtered through a bank of triangular filters spaced on the mel scale (a perceptual frequency scale that matches how human hearing works). Logarithmic compression is applied to the filter outputs, which approximates the ear's logarithmic sensitivity to loudness. The result is a 2D grid: time on one axis, mel frequency on the other, with intensity showing how much energy is present at each frequency band at each moment.

Whisper large-v3 uses 128 mel frequency bins (earlier Whisper versions and most other models use 80). This 2D grid is what the encoder reads.

The Encoder: Extracting Meaning from Spectrogram Patches

The encoder's job is to turn the spectrogram into a rich internal representation of what is being said.

Transformer encoders process the spectrogram in overlapping patches with multi-head self-attention. Each attention head compares every position in the spectrogram with every other position to identify which parts of the audio context are relevant to each other. Earlier convolutional layers (or convolutional sub-blocks, as in the Conformer architecture) capture local patterns, while transformer layers capture long-range dependencies. This is what allows the model to know that the "right" in a sentence about turning a corner is not the same sound as in a sentence about being correct, even before the decoder has committed to either reading.

The Conformer (published by Google at INTERSPEECH 2020) adds a convolution module inside each transformer block, mixing local and global processing in a way that works particularly well for speech because acoustic features have strong local correlation.

The Decoder: Generating Text, Token by Token

The decoder is where the model actually produces text.

Whisper uses an autoregressive seq2seq decoder with cross-attention. The decoder generates text tokens one at a time. At each step it uses self-attention over the tokens it has already output (to maintain coherent language) and cross-attention to the encoder's output (to stay grounded in the audio evidence). The cross-attention weights determine which parts of the spectrogram the model is "listening to" as it writes each token.

Tokens are subword units (BPE vocabulary), not phonemes. The model never produces an explicit phoneme layer; it maps audio directly to word-piece tokens in one learned process. If you have heard that modern ASR still uses phoneme-level acoustic models and a separate statistical language model, that description fits the classic hybrid pipeline, which CTC-based models partially and seq2seq models almost entirely replaced.

CTC, Seq2Seq, and Transducers: Three Strategies for One Problem

Not all neural ASR systems use the same output mechanism. Understanding the three main approaches is the core of this lane.

CTC (Connectionist Temporal Classification) was proposed in 2006 and remains widely used (wav2vec 2.0 from Facebook AI is the prominent modern example). The model outputs a probability distribution over tokens at every audio frame. CTC then collapses the frame-level sequence by merging repeated tokens and removing blank tokens to produce the final transcript. The key property is that CTC assumes conditional independence between output tokens: each frame's prediction is made without access to previously predicted tokens. This makes CTC fast and easy to train, but the no-language-model-in-the-loop assumption means it can be lexically incoherent on rare words.

Seq2seq attention (encoder-decoder) is what Whisper uses. The autoregressive decoder has full language-model-like behavior: each token conditions on everything that came before. This produces more natural, coherent text, especially on out-of-vocabulary or domain-specific words. The tradeoff is that decoding is sequential and slower, and the model needs to process the full audio before generating (batch processing rather than streaming).

RNN-T (Recurrent Neural Network Transducer) combines the strengths of both. Its acoustic encoder and prediction network (a small autoregressive language model over previously predicted tokens) are fused through a joint network. The key advantage is streaming: RNN-T can emit tokens as audio arrives, without waiting for the end of the utterance. This is why transducer-based models dominate production streaming applications, voice assistants, and real-time captioning systems. It originated in work by Alex Graves and has become the backbone of on-device ASR in products like Google's Pixel recorder.

Architecture	Output strategy	Language modeling	Streaming-ready	Representative model
CTC	Frame-level, collapsed	External or none	Yes (with care)	wav2vec 2.0
Seq2seq attention	Autoregressive, token by token	Built-in	No (needs full audio)	Whisper
RNN-T Transducer	Joint encoder + prediction net	Built-in	Yes, naturally	Google/on-device ASR

Hybrid CTC + attention training is increasingly common: the CTC loss encourages monotonic alignment during training, helping the attention decoder converge faster, while the attention decoder handles final output.

What Training Data Actually Shapes

Training data volume and diversity determine generalization more than architecture alone.

The original Whisper (2022) trained on 680,000 hours of weakly supervised multilingual audio, covering 96 languages and including 125,000 hours of translation data. Whisper large-v3 (released late 2023) was trained on 1 million hours of weakly labeled audio plus 4 million hours of pseudo-labeled audio generated using Whisper large-v2, for roughly 5 million hours total, with 1.55 billion parameters and support for 99 languages.

"Weakly supervised" means the training labels came from existing internet sources (subtitles, transcripts) rather than painstakingly annotated studio recordings. The diversity of conditions this introduces is part of why the model handles real-world audio better than many competitors trained on smaller but cleaner corpora.

Some models are trained jointly on transcription, translation, language identification, and timestamp prediction. In Whisper's case, task conditioning is handled entirely through special decoder tokens prepended to the output sequence: the same transformer weights learn all four tasks simultaneously. This multi-task training appears to produce richer audio representations than single-task training.

Speaker Diarization: A Parallel System

Diarization, figuring out who said what, is not built into the core ASR model. It runs as a separate pipeline alongside transcription.

Modern neural diarization extracts speaker embeddings, then clusters them. The audio is chunked into overlapping segments. A neural encoder (commonly a TDNN or ResNet architecture) converts each segment into a high-dimensional vector that captures the speaker's vocal characteristics: pitch patterns, formant frequencies, vocal tract shape. These embeddings are clustered hierarchically. Segments with similar embeddings are assigned the same speaker label.

The core challenge is that the number of speakers is unknown. Modern systems use learned similarity thresholds to decide where to cut the dendrogram. Overlapping speech (two people talking at the same time) remains an active research problem: most production systems handle simultaneous speech by assigning one speaker per frame, which means the quieter voice in an overlap is typically lost.

Tools like meeting transcription and interview transcription apply diarization as a post-processing step after the core ASR completes.

Accuracy: What the Numbers Actually Mean

The headline accuracy figures you see in marketing material almost always reflect clean read speech, not real-world audio.

Whisper large-v3 achieves 2.7% WER on LibriSpeech test-clean, which is read audiobook audio with studio-level quality. On the Hugging Face Open ASR Leaderboard (mid-2026 snapshot), Whisper large-v3 posted a mean WER of 7.44% across diverse real-world test sets. Top-ranked models like Cohere's Transcribe (5.42% mean WER) and ElevenLabs Scribe v2 (5.83%) push the state of the art, but these numbers shift as new models enter the leaderboard.

For practical content, real-world WER in meetings, podcasts, and phone calls typically ranges from 8% to 15% depending on noise level, number of speakers, and accent diversity. The gap between benchmark WER and practical WER is the main reason you should pilot any ASR system on your own audio before committing.

My take: WER on LibriSpeech tells you about the model's ceiling. WER on your actual content is the only number that matters for your use case.

ConvertAudioToText speech-to-text tool handling an uploaded audio file

Why End-to-End Models Beat the Old Hybrid Pipeline

The classic pipeline had a feature extractor, an HMM-based acoustic model that mapped features to phonemes, a pronunciation lexicon, and a separately trained n-gram language model. Each stage was optimized independently, so errors in one stage were invisible to the others.

End-to-end models jointly optimize the entire mapping from audio to text. If a word sounds acoustically ambiguous, the neural decoder can immediately use sentence context to resolve it, because the acoustic evidence and the language model are the same set of weights. There is no seam between "what phoneme is this?" and "what word does this phoneme sequence spell?". This is the architectural reason why accuracy on difficult audio improved so sharply between 2019 and 2023.

The shift came in two waves. CTC-based models like DeepSpeech (2014) were the first practical end-to-end ASR systems. Transformer-based seq2seq models from around 2020 onward extended this further by removing the conditional-independence assumption that made CTC's LM integration weak.

If you want a detailed look at what happens after the ASR output exists (punctuation restoration, diarization merge, paragraph formatting, export), see how AI transcription works. If you want the pre-neural history starting with HMMs and MFCCs, how speech recognition works covers the full arc.

What Still Limits Modern ASR

Architecture improvements have a ceiling. Some limitations come from physics and data, not model design.

Overlapping speech is genuinely hard: two voices occupy the same time-frequency bins, and separating them requires source separation models (like Conv-TasNet) running before the ASR model, which adds latency and complexity.

Low-resource languages remain poorly served. A language with only a few hundred hours of labeled training data will produce a model with WER an order of magnitude higher than a high-resource language, regardless of architecture sophistication. This is a reason AI struggles with low-resource languages.

Background music is categorically different from background noise. Consistent noise can be filtered with voice activity detection. Music occupies overlapping frequency ranges to speech, so the model often hallucinates lyrics or words when music is present. This is a known failure mode for Whisper specifically.

Proprietary models like Deepgram Nova optimize for different tradeoffs (latency, streaming, domain adaptation) using architectures their teams have not published in detail. For a breakdown of what Nova-3 does differently, see Deepgram Nova-3 explained.

For work that needs clean transcripts fast without running your own ASR infrastructure, ConvertAudioToText handles the full pipeline for you, including diarization and export to SRT, VTT, and plain text.

Frequently Asked Questions

What is the difference between CTC and seq2seq in speech recognition?

CTC predicts a token at every audio frame independently, then collapses the sequence. It is fast and streaming-friendly but makes no language-model-level decisions during decoding. Seq2seq (like Whisper) uses an autoregressive decoder with cross-attention to the encoder: it generates each token conditioned on all previous tokens and all audio evidence together. This produces more coherent text, especially on domain-specific vocabulary, but requires processing the full audio before generating output.

Why do modern ASR models not use phonemes as an intermediate step?

Older hybrid systems used phonemes because the acoustic model and language model were trained separately: the acoustic model needed a fixed output space (phonemes) that the language model could consume. End-to-end models learn the full audio-to-text mapping in one pass, so they output subword tokens (BPE units) directly. The model implicitly learns whatever intermediate representations are useful, which are not necessarily phonemes.

How does multi-task training improve speech recognition?

Training a model simultaneously on transcription, translation, language identification, and timestamp prediction forces it to build audio representations that generalize across tasks. Whisper, for example, conditions all four tasks through special prefix tokens prepended to the decoder: the same encoder weights must capture enough acoustic detail to support all of them. The result is a model with better accent and language robustness than a transcription-only model trained on the same data.

How accurate is AI speech recognition on real-world audio in 2026?

On read speech (LibriSpeech test-clean), leading models achieve around 2-3% WER. On real-world audio, including meetings, podcasts, and phone calls, WER typically ranges from 8-15% depending on noise, overlapping speakers, and accent diversity. Mid-2026 leaderboard leaders like Cohere's Transcribe report 5.42% mean WER across mixed test sets, though these numbers reflect the leaderboard composition and shift as new models are evaluated.

What makes the Conformer architecture different from a standard transformer for ASR?

A standard transformer uses self-attention across all positions to capture global context. The Conformer, published by Google at INTERSPEECH 2020, adds a convolution module inside each transformer block between two feed-forward modules. This allows each block to capture both local acoustic patterns (via convolution) and long-range dependencies (via attention) simultaneously. Because speech has strong local structure (acoustic features are highly correlated with neighboring frames), this combination outperforms a pure transformer for ASR on most benchmarks.

Sources

OpenAI Whisper GitHub (model card, architecture, training data): https://github.com/openai/whisper
Hugging Face: openai/whisper-large-v3 model card: https://huggingface.co/openai/whisper-large-v3
Google Research: Conformer (INTERSPEECH 2020): https://research.google/pubs/conformer-convolution-augmented-transformer-for-speech-recognition/
Hugging Face Audio Course, Seq2Seq architectures: https://huggingface.co/learn/audio-course/chapter3/seq2seq
Hugging Face Transformers: wav2vec2 documentation: https://huggingface.co/docs/transformers/en/model_doc/wav2vec2
pyannote.ai: What is Speaker Diarization: https://www.pyannote.ai/blog/what-is-speaker-diarization
Hugging Face Open ASR Leaderboard (mid-2026 snapshot): https://huggingface.co/spaces/hf-audio/open_asr_leaderboard
AssemblyAI: Word error rate is broken (2026): https://www.assemblyai.com/blog/word-error-rate-is-broken

Try transcription free

Convert any audio or video to clean, unwatermarked text — speaker labels, timestamps, and AI summaries included. First 10 minutes free, no account.

speech recognitiontechnical

Acoustic Models vs Language Models in Speech Recognition

What acoustic models and language models do in ASR, why the split mattered historically, how end-to-end systems absorbed it, and why it still explains the errors you see today.

May 26, 202611 min

deepgramnova

Deepgram Nova-3 Pricing 2026: $0.0043/min Batch, $0.0077 Streaming

Deepgram Nova-3 costs $0.0043/min for batch and $0.0077/min for streaming in 2026. Honest accuracy vs Whisper, 50+ language coverage, keyterm prompting, and when Nova-3 beats cheaper alternatives like AssemblyAI.

May 26, 202614 min