Deepgram Nova-3 Explained: Speed, Accuracy, and Streaming
deepgramnovaspeech recognition

Deepgram Nova-3 Explained: Speed, Accuracy, and Streaming

ConvertAudioToText TeamMay 26, 20267 min read

Deepgram released Nova-3 in January 2025, replacing Nova-2 as their flagship speech recognition model. It powers a meaningful share of real-time and high-throughput transcription tooling shipped in 2025 and 2026, including parts of the ConvertAudioToText pipeline. This post explains what Nova-3 is, what changed from Nova-2, and where it fits in a stack alongside Whisper.

The Short Version

Nova-3 is Deepgram's third-generation end-to-end speech recognition model, built to serve both batch and streaming use cases. It is a closed commercial model (architecture details are not public) optimized for low latency, high throughput, and the ability to handle code-switched speech across 36 languages with enterprise-tier customization. Pricing on the Deepgram cloud sits at roughly $0.0043 per minute for the pre-recorded API.

If you have used a transcription tool that produces results in seconds rather than minutes, Nova-3 or one of its predecessors is often the engine doing the work.

Architecture: What We Know

Deepgram has not published parameter counts, layer counts, or the specific transformer variant Nova-3 uses. What is documented publicly:

  • It is an end-to-end model, meaning the same network handles acoustic modeling, language modeling, and final text production rather than three separate stages.
  • It is trained on a multimillion-hour proprietary audio dataset spanning enterprise call audio, podcasts, conversational data, and noisy real-world recordings.
  • It supports both streaming (real-time) and pre-recorded API modes, with the same underlying model serving both.
  • It produces word-level timestamps and confidence scores directly.

Deepgram's broader engineering posture has been to optimize aggressively for inference latency and cost per minute. Nova-3 continues that line: the public benchmarks claim 30 to 40 percent improvement in throughput over Nova-2 with comparable or better accuracy.

What Changed from Nova-2

Deepgram's own release notes for Nova-3 highlight a few targeted improvements over Nova-2.

Better multilingual coverage. Nova-2 was strongest in English with adequate performance in a handful of other languages. Nova-3 expanded the formally supported language set to 36 and added code-switching detection within a single audio stream.

Reduced hallucination on edge cases. Older Deepgram models occasionally produced text during music-only or near-silent stretches. Nova-3 includes additional voice activity detection inside the pipeline that suppresses these false positives.

Enterprise customization. Nova-3 supports keyterm boosting (telling the model to expect certain proper nouns or domain terms) more reliably than Nova-2's keyword feature. For customers with industry-specific vocabularies, this is a meaningful improvement.

Better speaker diarization. The diarization output now handles up to 12 speakers reliably, compared to roughly 6 to 8 in Nova-2. Our deeper post on handling multiple speakers in AI covers what diarization can and cannot do.

Strengths in Production Use

Where Nova-3 consistently shines.

Real-time latency. Nova-3 streaming API produces transcripts with end-to-end latency under 300 milliseconds in good conditions. For live captioning, voice assistants, and meeting transcription where users see text as someone speaks, this latency is where Deepgram pulls ahead of batch-oriented models like Whisper.

Throughput at scale. Deepgram's infrastructure is designed for high-volume processing. Customers transcribing thousands of hours per day report stable throughput where self-hosted or single-tenant deployments would struggle.

Cost per minute. At roughly $0.0043 per minute on the standard tier, Nova-3 is among the cheapest commercial transcription APIs available. For high-volume use cases (call centers, broadcast captioning, enterprise compliance), this matters.

Custom vocabulary support. The keyterm feature lets you provide a list of expected terms (company names, drug names, technical jargon) and biases the model toward them. Accuracy on proper nouns and domain-specific vocabulary improves meaningfully with even a short keyterm list.

Confidence scores. Nova-3 produces well-calibrated word-level confidence scores. Our post on transcription confidence scores covers how to use these in downstream applications.

Known Weaknesses

Where Nova-3 falls short.

Closed model. You cannot self-host Nova-3, fine-tune it on your own data (outside of enterprise contracts), or run it offline. For applications with strict data residency or air-gap requirements, an open model like Whisper is the only path.

Language coverage compared to Whisper. Nova-3 supports 36 languages well; Whisper supports roughly 99 with varying quality. For projects with content in less-common languages, Whisper often wins by default.

Accuracy ceiling on the hardest audio. On clean, single-speaker English audio, Nova-3 and Whisper Large-v3 produce comparable accuracy. On the most degraded conditions (heavily accented speech, low-bitrate phone audio, multiple overlapping speakers), the gap between the two is small and use-case dependent. Our broader post on Whisper Large-v3 covers the comparison.

Transparency. Because the model architecture and training data are not public, evaluating systematic biases or failure modes requires hands-on testing rather than academic literature review. For research or regulated use cases where model provenance matters, this can be a constraint.

Streaming vs Batch

Nova-3 serves both streaming and batch transcription through different API endpoints, but the use cases for each are distinct.

Streaming

Streaming is for use cases where the user sees output as it is being produced. Live captioning, real-time meeting transcription, voice assistant input, and accessibility tools all fit here. Latency under 500 milliseconds matters; absolute accuracy can be marginally lower because the user often sees corrections appear as the model gains more context.

Nova-3's streaming mode produces interim transcripts (best guess, may be revised) followed by finalized transcripts (locked) as additional audio arrives. Most streaming integrations handle the swap automatically.

Batch

Batch mode is for pre-recorded audio. Latency is measured in seconds for short files and minutes for long files. Accuracy is higher than streaming mode because the model has the full audio context.

Most of the use cases on the CATT transcription tools run batch processing. Real-time use cases are common in live captioning workflows but less common in standard transcription tools.

Pricing Context

Deepgram's published pricing on Nova-3:

TierPrice per minute
Nova-3 pre-recordedAbout $0.0043
Nova-3 streamingAbout $0.0058
Self-service onboarding free tier$200 in credits

For comparison, our broader transcription pricing comparison puts Deepgram's per-minute cost at roughly half of OpenAI's Whisper API and a fifth of Rev's AI tier. For volume use cases, this matters. For solo creators or small teams, flat-rate unlimited plans like the CATT $9.99 tier usually beat per-minute pricing once you cross 40 hours of audio per month.

Nova-3 in the CATT Pipeline

ConvertAudioToText routes audio to different engines based on use case. Pre-recorded uploads to English transcription and language-specific pages route to whichever engine (Whisper Large-v3 or Nova-3) produces the best result for that audio profile. For real-time use cases or streaming applications, Nova-3 is typically the engine selected.

The practical effect: end users do not pick the engine. The pipeline does, based on language, audio length, and feature requirements (diarization, summary, real-time).

How to Decide Between Nova-3 and Whisper

A short checklist.

Real-time latency matters: Nova-3. Self-hosted or air-gapped requirement: Whisper. 99-language coverage matters: Whisper. 36 well-supported languages is enough, and you want lower per-minute cost: Nova-3. Closed-source compliance acceptable: Nova-3 is easier to operate. Open-source compliance required: Whisper, hands down.

For most production transcription tooling in 2026, the right answer is "use both and route between them based on the request." That is what ConvertAudioToText and several other tools in the Otter AI alternatives space do behind the scenes.

Where to Start

The cleanest way to evaluate Nova-3 against your own audio is to run it through a service that uses it. The 60-minute CATT free tier covers a meaningful evaluation period. For Deepgram's own onboarding, their $200 self-service credit covers about 46,000 minutes of batch Nova-3 processing, which is more than enough to compare against your existing workflow.

Try transcription free

Convert any audio or video to accurate text in seconds. Speaker labels, timestamps, and AI summaries included. No account required.

Related Articles