deepgramnovaspeech recognition

Deepgram Nova-3 Pricing 2026: $0.0043/min Batch, $0.0077 Streaming

BMMamane B. MoussaMay 26, 2026Updated June 30, 202614 min read

Summarize this article with:

TL;DR

Deepgram Nova-3 is a closed, commercial speech-to-text model built for low-latency streaming and high-volume batch work. It runs about $0.0043 per minute for pre-recorded audio (around $0.26 per hour) and about $0.0077 per minute streaming, supports 50-plus languages with real-time code-switching across 10, and ships keyterm prompting for domain vocabulary. Its accuracy and latency numbers are strong but self-published, so test on your own audio before trusting them. Pick Nova-3 for real-time and tight latency budgets; pick Whisper when you need open-source, self-hosting, or 99-language coverage.

Deepgram released Nova-3 in January 2025 as the replacement for Nova-2, its previous flagship speech recognition model. It now powers part of the real-time and high-throughput transcription tooling shipped in 2025 and 2026, and it runs inside parts of the ConvertAudioToText pipeline. This post explains what Nova-3 is, what its real prices and limits are in 2026, and how it stacks up against the open-source default, Whisper.

I spend most of my week routing audio between transcription engines, so my goal here is the version I wish vendors published: the numbers that are real, the ones that are marketing, and the line between them.

The headline accuracy and latency figures are Deepgram's own benchmarks, not independent ones, so treat them as a starting point and test on your audio. Choose Nova-3 when real-time latency or per-second billing matters. Choose Whisper when you need open-source control, self-hosting, or coverage of less common languages.

Architecture: what is actually documented

Deepgram has not published parameter counts, layer counts, or the specific transformer variant Nova-3 uses. It is a closed commercial model, so the architecture is a black box. What the company does document publicly:

It is an end-to-end model. The same network handles acoustic modeling, language modeling, and final text production rather than three separate stages.
It is trained on a large proprietary audio dataset spanning enterprise call audio, podcasts, conversational data, and noisy real-world recordings.
It serves both streaming (real-time) and pre-recorded API modes from the same underlying model.
It returns word-level timestamps and confidence scores directly.

Deepgram's engineering posture has been to optimize for inference latency and cost per minute. Nova-3 continues that line: the launch positioned it as a speed and customization play, not a raw-accuracy moonshot.

What changed from Nova-2

Deepgram's launch post for Nova-3 highlights a few targeted improvements over Nova-2.

Keyterm prompting. This is the flagship feature. You can pass up to 100 key terms (company names, drug names, technical jargon) at request time and the model biases toward them, with no model retraining. Deepgram calls it the first self-serve customization of its kind in a voice AI model. For domain-specific vocabularies, this is the most useful single addition.

Real-time code-switching. Nova-3 can transcribe audio that switches between languages mid-sentence across 10 languages (English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch) using the multi language code. Nova-2 handled mixed-language audio poorly.

Wider language coverage. Nova-3 launched with 36 supported languages and Deepgram expanded it through 2025 to 50-plus per the current docs.

Fewer hallucinations on silence and music. Older Deepgram models sometimes produced text during music-only or near-silent stretches. Nova-3 includes voice activity detection inside the pipeline that suppresses those false positives. Our post on how voice activity detection works covers why this matters.

The accuracy claims, read honestly

This is where most write-ups go wrong, so read carefully. Every accuracy number Deepgram publishes for Nova-3 comes from Deepgram's own testing. There is no neutral third-party benchmark behind them. Here is what the launch post actually claims, kept separate so nothing gets conflated:

Streaming word error rate (WER): 54.2% lower than the next-best competitor, which Deepgram does not name. Median WER of 6.84% versus the competitor's 14.92%.
Batch WER: 47.4% lower than the next-best competitor. Median WER of 5.26% versus roughly 10%.
Versus Whisper specifically: Deepgram says Nova-3 was preferred on 7 of 7 languages tested, reaching up to an 8-to-1 preference on some, across roughly 200 audio samples.

The "54%" figure is against an unnamed competitor, not against Whisper. The Whisper comparison is the separate 7-of-7 preference test. A lot of secondhand posts merge those two into "54% more accurate than Whisper," which is not what Deepgram said. WER also depends heavily on audio quality, accent, and domain, so a vendor's median on its own test set is not a promise about your call recordings.

To remember: Nova-3's accuracy numbers are real claims from a real benchmark, but it is the vendor's benchmark. The only number that matters for your project is the one you measure on your own audio.

Strengths in production use

Where Nova-3 consistently earns its place.

Real-time latency. The streaming API produces transcripts with end-to-end latency in the 200 to 300 millisecond range in good conditions, comparable to Nova-2. For live captioning, voice assistants, and meeting transcription where users watch text appear as someone speaks, this is where Deepgram pulls ahead of batch-first models like Whisper.

Per-second billing. Deepgram bills per second of audio processed rather than rounding up to the nearest minute. On large volumes of short clips, that rounding difference is real money saved versus providers that round up.

Throughput at scale. Deepgram's infrastructure is built for high-volume processing. Teams transcribing thousands of hours per day report stable throughput where a self-hosted single-tenant deployment would strain.

Keyterm prompting. Pass a short list of expected terms and accuracy on proper nouns and domain vocabulary improves noticeably, with no retraining and no batch job. Our post on fixing mistranscribed names covers when this is worth the setup.

Well-calibrated confidence scores. Nova-3 returns word-level confidence you can act on downstream. Our post on transcription confidence scores covers how to use them.

Known weaknesses

Where Nova-3 falls short, stated plainly.

It is closed. You cannot self-host Nova-3, run it offline, or fine-tune it on your own data outside an enterprise contract. For strict data-residency or air-gapped requirements, an open model like Whisper is the only path. Our post on data residency for transcription covers when that constraint is a hard blocker.

Language coverage trails Whisper. Nova-3 supports 50-plus languages well; Whisper supports roughly 99 with varying quality. For content in less common languages, Whisper often wins by default. Our post on why AI struggles with low-resource languages covers the gap.

No daylight on the hardest audio. On clean single-speaker English, Nova-3 and Whisper Large-v3 land close. On the worst conditions (heavy accents, low-bitrate phone audio, overlapping speakers), the gap between them is small and use-case dependent. Our Whisper Large-v3 explainer covers that comparison.

No published audit trail. Because the architecture and training data are not public, evaluating systematic bias or failure modes means hands-on testing, not reading a paper. For research or regulated use where model provenance matters, that opacity is a real constraint.

Diarization has no documented speaker cap, in either direction. Deepgram's diarization detects speakers without you specifying a count in advance, and the docs do not publish a maximum. If you have seen a "supports up to 12 speakers" claim (including in an earlier version of this very post), that number is not in Deepgram's documentation. Treat speaker count as something to test on your own multi-party audio. Our post on handling multiple speakers in AI covers what diarization can and cannot do.

Pricing in 2026

Here is Deepgram's published pay-as-you-go pricing for Nova-3. Figures move, so confirm on the Deepgram pricing page before you build a budget on them.

Mode	Price (Nova-3, pay-as-you-go)
Pre-recorded (batch), monolingual	About $0.0043 per minute (~$0.26/hr)
Streaming, monolingual	About $0.0077 per minute (~$0.46/hr)
Multilingual (`multi`) streaming	About $0.0058 per minute
Free signup credit	$200, no expiration
Growth plan	From about $4,000 per year, discounted rates

Nova-3 cost per hour by mode

Pre-recorded (batch)

~$0.26/hr

Multilingual stream

~$0.35/hr

Streaming

~$0.46/hr

Nova-3 pay-as-you-go, 2026.

Two things the per-minute number hides. First, Deepgram bills per second, so short-clip workloads pay less than the per-minute rate implies. Second, the $200 free credit is genuinely large, enough to transcribe many hundreds of hours of batch audio before you pay anything, which makes it a real evaluation budget rather than a teaser.

How that compares to other APIs

It is fashionable to call Deepgram "the cheapest." On batch transcription in 2026, that is no longer true. Here are base pre-recorded rates, normalized to per hour, from each provider's own pricing.

Provider / model	Batch price (per hour)
AssemblyAI Universal-2	$0.15
Rev AI Reverb	$0.20
AssemblyAI Universal-3.5 Pro	$0.21
Deepgram Nova-3	About $0.26
OpenAI gpt-4o-transcribe	$0.36
OpenAI gpt-4o-mini-transcribe	$0.18

Nova-3 is competitively priced among premium APIs, not the floor. AssemblyAI and Rev's AI tier undercut it on base batch rates. Where Deepgram wins is real-time latency and per-second billing, not the sticker price. One caveat that makes any table like this only a starting point: these are base transcription rates, and some providers (AssemblyAI states this explicitly) bill speaker diarization, summarization, and other understanding features separately, so the base rate is not your total bill.

For solo creators and small teams, none of these per-minute APIs is usually the right buy anyway. Once you cross roughly 40 hours of audio per month, a flat-rate unlimited plan like the CATT $9.99 tier beats metered pricing. Our transcription pricing comparison walks through the crossover math.

Streaming vs batch

Nova-3 serves both through different endpoints, and the use cases are distinct.

Streaming

Streaming is for when the user sees output as it is produced: live captioning, real-time meeting transcription, voice-assistant input, accessibility tools. Latency under 500 milliseconds matters, and absolute accuracy can dip slightly because the model often shows corrections as it gains context. Nova-3 streaming returns interim transcripts (best guess, may be revised) followed by finalized transcripts (locked) as more audio arrives. Most integrations handle the swap automatically.

Batch

Batch mode is for pre-recorded audio. Latency is seconds for short files and minutes for long ones, and accuracy is higher than streaming because the model has the full audio context. Most standard transcription work, including the uploads on the CATT English transcription tool, is batch. Real-time is common in live captioning workflows and rarer in upload-and-wait tools.

Where Nova-3 sits in the CATT pipeline

ConvertAudioToText routes each job to whichever engine fits it, and the user never picks. Being honest about Nova-3's real role here matters, because it is not the default for most jobs.

The ConvertAudioToText uploader: drag and drop a file, record live, or paste a URL, with the first 30 minutes free

AssemblyAI Universal is the default premium engine for signed-up accounts, because diarized output is the priority for account holders.
Cloudflare Whisper is the default free engine for anonymous landing-page previews, because it costs $0.
Deepgram Nova-3 is the meeting-bot recording default, the diarized fallback when AssemblyAI and Gladia are unavailable, a free-tier fallback, and the engine you get if you explicitly request Deepgram.

So Nova-3 is a real, load-bearing part of the stack, but as a meeting-bot default and a reliable fallback, not the primary engine for uploads. The practical effect for users: the pipeline picks the engine based on language, audio length, account status, and feature needs (diarization, summary), and you get the best fit without choosing. The cleanest way to see the output quality is to run your own file through the 60-minute free English tool.

How to decide between Nova-3 and Whisper

A short checklist.

Real-time latency matters: Nova-3.
Self-hosted or air-gapped requirement: Whisper.
99-language coverage matters: Whisper.
50-plus well-supported languages is enough and you want per-second cloud billing: Nova-3.
Closed-source compliance is acceptable and you want managed infrastructure: Nova-3.
Open-source compliance is required: Whisper, no contest.

For most production transcription tooling in 2026, the honest answer is "use both and route between them per request." That is what ConvertAudioToText and several tools in the Otter alternatives space do behind the scenes.

Frequently Asked Questions

How much does Deepgram Nova-3 cost per minute?

Nova-3 pre-recorded (batch) transcription runs about $0.0043 per minute, roughly $0.26 per hour, on the pay-as-you-go tier. Streaming runs about $0.0077 per minute, roughly $0.46 per hour. Multilingual streaming is about $0.0058 per minute. Deepgram bills per second of audio, and new accounts get a $200 free credit with no expiration. Prices change, so confirm on Deepgram's pricing page before budgeting.

Is Nova-3 more accurate than Whisper?

On Deepgram's own benchmarks, Nova-3 was preferred over Whisper across all 7 languages they tested, reaching up to an 8-to-1 preference on some. That is Deepgram's test, not an independent one. In practice, on clean single-speaker English the two are close, and on the hardest audio (heavy accents, phone-quality recordings, overlapping speech) the gap is small and depends on your specific audio. The only number that matters is the one you measure yourself.

How many languages does Nova-3 support?

Nova-3 launched in January 2025 with 36 languages and expanded through 2025 to 50-plus per Deepgram's current documentation. Real-time code-switching, where the model handles audio that mixes languages mid-stream, is supported across 10 languages: English, Spanish, French, German, Hindi, Russian, Portuguese, Japanese, Italian, and Dutch. Whisper still covers more total languages, around 99, at varying quality.

Can I self-host or run Nova-3 offline?

No. Nova-3 is a closed commercial model. You cannot download it, run it offline, or fine-tune it on your own data outside an enterprise contract. If you need self-hosting, air-gapped deployment, or full data control, an open model like Whisper Large-v3 is the path to take.

What is keyterm prompting and does it actually help?

Keyterm prompting lets you pass up to 100 expected terms (company names, product names, medical or legal jargon) with a transcription request, and Nova-3 biases toward them with no retraining. It helps most when your audio is full of proper nouns or domain vocabulary a general model would miss. For a list of common names, drug names, or technical terms, even a short keyterm list improves accuracy on those specific words noticeably.

How many speakers can Nova-3's diarization handle?

Deepgram's diarization detects speakers without you specifying a count in advance, and the documentation does not publish a maximum. Claims of a specific cap like "up to 12 speakers" are not in Deepgram's docs. Treat speaker count as something to verify on your own multi-party audio, because diarization quality on overlapping or crowded recordings varies by engine.

Does Deepgram Nova-3 support real-time streaming transcription?

Yes. Nova-3 serves streaming through a WebSocket endpoint with end-to-end latency in the 200 to 300 millisecond range in good conditions. It returns interim transcripts that may be revised, then finalized transcripts that are locked, as more audio arrives. This is Nova-3's strongest use case relative to batch-first models like Whisper.

Is Nova-3 the cheapest transcription API?

Not on batch. In 2026, AssemblyAI Universal-2 ($0.15/hr), Rev AI Reverb ($0.20/hr), and AssemblyAI Universal-3.5 Pro ($0.21/hr) all undercut Nova-3's roughly $0.26/hr base batch rate. Nova-3 competes on real-time latency and per-second billing rather than the lowest sticker price. Note that some providers bill diarization and other features separately, so base rates are not total cost.

Where to start

The cleanest way to evaluate Nova-3 against your own audio is to run it through a service that uses it. The 60-minute CATT free tier covers a real evaluation. For Deepgram's own onboarding, the $200 self-service credit covers hundreds of hours of batch Nova-3 processing, which is more than enough to compare against your existing workflow before you commit a budget.

Sources

Try transcription free

Convert any audio or video to clean, unwatermarked text — speaker labels, timestamps, and AI summaries included. First 30 minutes free, no account.

speech recognitiontechnical

Acoustic Models vs Language Models in Speech Recognition

What acoustic models and language models do in ASR, why the split mattered historically, how end-to-end systems absorbed it, and why it still explains the errors you see today.

May 26, 202611 min

transcriptionai

How AI Transcription Works: The Product Pipeline Explained (2026)

From upload to exported transcript: a clear walkthrough of every stage in the AI transcription pipeline, including VAD, ASR, diarization, post-processing, and export.

May 26, 202610 min

Summarize this article with:

Architecture: what is actually documented

What changed from Nova-2

The accuracy claims, read honestly

Strengths in production use

Known weaknesses

Pricing in 2026

How that compares to other APIs

Streaming vs batch

Streaming

Batch

Where Nova-3 sits in the CATT pipeline

How to decide between Nova-3 and Whisper

Frequently Asked Questions

How much does Deepgram Nova-3 cost per minute?

Is Nova-3 more accurate than Whisper?

How many languages does Nova-3 support?

Can I self-host or run Nova-3 offline?

What is keyterm prompting and does it actually help?

How many speakers can Nova-3's diarization handle?

Does Deepgram Nova-3 support real-time streaming transcription?

Is Nova-3 the cheapest transcription API?

Where to start

Sources

Try transcription free

Related Articles

Acoustic Models vs Language Models in Speech Recognition

How AI Transcription Works: The Product Pipeline Explained (2026)