apitranscriptionpricingdevelopers

Speech-to-Text API Pricing 2026: Real Costs Compared

BMMamane B. MoussaFebruary 23, 2026Updated June 30, 202619 min read

Summarize this article with:

TL;DR

As of June 2026, the cheapest full-featured speech-to-text APIs are AssemblyAI Universal-2 at $0.0025/min ($0.15/hr) and OpenAI gpt-4o-mini-transcribe at $0.003/min. Deepgram Nova-3 has moved up to $0.0077/min and is now mid-pack, not the price leader it was a year ago. Google Cloud and AWS Transcribe sit at the top of the range ($0.016 and $0.024 per minute) unless you use Google's 24-hour batch tier or hit AWS volume discounts. Match the API to your features (streaming, diarization, languages), not just the headline rate.

This guide compares the five APIs most teams actually shortlist in 2026: AssemblyAI, OpenAI (gpt-4o-transcribe family), Deepgram, Google Cloud Speech-to-Text, and AWS Transcribe. Every price below was pulled from each vendor's own pricing page in June 2026 and is linked at the bottom. Prices move, so always confirm against the live page before you commit a budget number.

If you want individual deep dives, we keep dedicated pages on Google Cloud Speech-to-Text pricing, AWS Transcribe pricing, and OpenAI Whisper API pricing.

Complete Pricing Comparison Table

Here is the side-by-side breakdown across all five providers as of June 2026. Rates are for standard batch (pre-recorded) transcription unless noted. AssemblyAI quotes per hour, the rest quote per minute, so both columns are shown to keep the comparison honest.

Provider	Cheapest model	$/min	$/hr	Free credit/tier	Streaming	Diarization
AssemblyAI	Universal-2 (async)	$0.0025	$0.15	$50 credit	Yes ($0.15/hr)	Yes
OpenAI	gpt-4o-mini-transcribe	$0.003	$0.18	None	No (batch only)	Yes (via gpt-4o-transcribe-diarize)
Deepgram	Nova-3 (pay-as-you-go)	$0.0077	$0.46	$200 credit	Yes	Yes
Google Cloud	Chirp (V2, dynamic batch)	$0.004	$0.24	60 min/month, ongoing	Yes ($0.016/min)	Yes
AWS Transcribe	Standard (tier 1)	$0.024	$1.44	60 min/month, 12 months	Yes	Yes

Cheapest plan: cost per hour of audio

AssemblyAI

$0.15/hr

OpenAI

$0.18/hr

Google Cloud

$0.24/hr

Deepgram

$0.46/hr

AWS Transcribe

$1.44/hr

Lowest published batch price per provider, 2026. Lower is cheaper.

A few things have changed since this comparison was first written in early 2026, and they matter:

AssemblyAI is now the price floor for a full-featured API. Universal-2 async transcription is $0.15 per hour, which works out to $0.0025 per minute, with speaker diarization and 99-language support included.
OpenAI added a cheaper model and diarization. gpt-4o-mini-transcribe is $0.003 per minute, and the new gpt-4o-transcribe-diarize model (released October 2025) gives Whisper-family transcription built-in speaker labels, something the old Whisper API never had.
Deepgram is no longer the cheapest. Nova-3 pay-as-you-go is $0.0077 per minute ($0.46/hour), up from the Nova-2 rates that made Deepgram the budget winner in 2024 and 2025.
Google's standard rate is now $0.016 per minute on the V2 API with Chirp models, and there is a separate 24-hour dynamic batch tier at $0.004 per minute for workloads that can wait.

Pricing comparison across the five major speech-to-text API providers in 2026

Understanding Each Provider's Pricing Model

AssemblyAI Pricing

AssemblyAI prices per hour of audio and bills async (pre-recorded) and streaming separately.

Universal-2 (async): $0.15 per hour ($0.0025 per minute)
Universal-3.5 Pro (async): $0.21 per hour ($0.0035 per minute)
Universal Streaming (real-time): $0.15 per hour
Universal-3.5 Pro Realtime: $0.45 per hour
Free credit: $50 at signup, no card required

The thing to watch with AssemblyAI streaming is the billing unit. Streaming is billed by session duration, meaning the time the WebSocket connection stays open, not the length of audio you actually send. If you keep a stream open through long silences, you pay for that idle time. For async transcription this is not a concern because you pay for the audio you submit.

Diarization, word-level timestamps, and language detection are included in the base async rate. AssemblyAI publishes 99-language coverage for async transcription, with a smaller set for real-time multilingual streaming.

A pricing note worth flagging: AssemblyAI announced that effective July 1, 2026, in-region model pricing rises about 10%, but you can keep current rates by adding "model_region": "global" to your requests. Confirm the live rate before you lock a contract.

OpenAI Pricing (gpt-4o-transcribe family)

OpenAI's audio transcription pricing is now a small family of models rather than the single Whisper endpoint it used to be.

gpt-4o-transcribe: $0.006 per minute (estimated; billed by token)
gpt-4o-mini-transcribe: $0.003 per minute (estimated; billed by token)
gpt-4o-transcribe-diarize: transcription with built-in speaker labels
No free tier, no volume discounts

Two corrections to the old conventional wisdom. First, OpenAI now has a $0.003/min option in gpt-4o-mini-transcribe, which undercuts the legacy $0.006 Whisper rate. Second, the long-standing complaint that "Whisper has no diarization" no longer holds: gpt-4o-transcribe-diarize ships speaker identification and can even map segments to up to four known speakers via reference audio.

The remaining limitations are real. The transcription API is batch only, so there is no true real-time streaming the way Deepgram, AssemblyAI, and Google offer it. The diarize model requires a chunking_strategy for audio longer than 30 seconds and does not support prompts or timestamp granularities. For a lot of podcast, interview, and meeting workloads that ship a finished file, none of that matters and the price is hard to beat.

Deepgram Pricing

Deepgram's pricing centers on the Nova-3 model in 2026.

Nova-3 monolingual (pay-as-you-go): $0.0077 per minute ($0.46 per hour)
Nova-3 multilingual: $0.0092 per minute
Growth plan: down to roughly $0.0065 per minute with a prepaid annual commitment (about 16% off)
Free credit: $200, no card required

Deepgram bills per second of audio, so a 37-second clip costs exactly 37 seconds' worth, not a rounded-up minute. That granularity is genuinely useful if you process lots of short clips.

The honest headline is that Deepgram is no longer the cheapest option. The Nova-2 rate of roughly $0.0043 per minute that made it the budget pick is being phased out of the public pricing in favor of Nova-3 at $0.0077. Deepgram still earns its place on the shortlist for low-latency streaming, strong English accuracy, and per-second billing, but the "Deepgram is always cheapest" rule of thumb from 2024 is out of date. Secondary sites still quote the old $0.0043 number; treat any figure that low as legacy Nova-2 pricing and check the live selector.

Google Cloud Speech-to-Text Pricing

Google Cloud Speech-to-Text reorganized its pricing around the V2 API and Chirp models.

Standard real-time recognition (Chirp, V2): $0.016 per minute ($0.96 per hour)
Dynamic batch: $0.004 per minute ($0.24 per hour), results delivered within 24 hours
Free tier: 60 minutes per month, ongoing, no expiration

The big shift here is consolidation. Earlier pricing had separate standard ($0.024) and enhanced ($0.036) tiers. The V2 API folds Chirp, Chirp 2, and Chirp 3 into the same $0.016 per minute rate with no per-model premium, which is both simpler and cheaper than the old enhanced tier. The standout value is dynamic batch at $0.004 per minute, which is 75% cheaper than real-time and ideal for any workload that does not need an instant answer, like overnight archive processing.

New Google Cloud accounts also get $300 in platform credits with a 90-day expiry, separate from the ongoing 60-minute monthly free allowance for Speech-to-Text specifically.

AWS Transcribe Pricing

AWS Transcribe keeps a single standard rate with automatic volume discounts that kick in as you scale.

Tier 1 (first 250,000 min/month): $0.024 per minute
Tier 2 (250K to 1M min/month): $0.015 per minute
Tier 3 (1M to 5M min/month): $0.0102 per minute
Tier 4 (5M+ min/month): $0.0078 per minute

AWS bills in one-second increments with a 15-second minimum charge per request. That 15-second floor is a real cost if you process many very short clips, because a 4-second clip still bills as 15 seconds. The free tier is 60 minutes per month for the first 12 months only, after which every minute is billable.

The volume discounts are the reason AWS stays in the conversation at scale. At more than 250,000 minutes per month (about 4,167 hours) you drop to $0.015, and the tiers keep falling to $0.0078 above 5 million minutes per month. For an AWS-native shop that already has audio in S3, the cross-service integration plus these automatic discounts can outweigh the higher list price.

Cost at Scale: Recomputed for 2026

Per-minute rates are easy to compare, but the invoice is what matters. Here are the real monthly costs at three volume tiers, recomputed from the current 2026 rates above. All figures use each provider's cheapest standard batch model.

100 Hours Per Month (6,000 minutes)

Typical for a transcription product with a few hundred active users, a podcast network processing weekly episodes, or a small support team.

Provider	Model	Monthly cost
AssemblyAI	Universal-2	$15.00
OpenAI	gpt-4o-mini-transcribe	$18.00
AssemblyAI	Universal-3.5 Pro	$21.00
Google Cloud	Dynamic batch	$24.00
OpenAI	gpt-4o-transcribe	$36.00
Deepgram	Nova-3	$46.20
Google Cloud	Chirp, real-time	$96.00
AWS Transcribe	Standard	$144.00

At 100 hours, AssemblyAI Universal-2 costs about 90% less than AWS standard and 84% less than Google's real-time rate. OpenAI's mini model and Google's 24-hour batch tier are both cheaper than Deepgram. This is the clearest sign the verdict has flipped since 2024.

1,000 Hours Per Month (60,000 minutes)

A growing product, a mid-size contact center, or a media company processing content at scale.

Provider	Model	Monthly cost
AssemblyAI	Universal-2	$150.00
OpenAI	gpt-4o-mini-transcribe	$180.00
AssemblyAI	Universal-3.5 Pro	$210.00
Google Cloud	Dynamic batch	$240.00
OpenAI	gpt-4o-transcribe	$360.00
Deepgram	Nova-3	$462.00
Google Cloud	Chirp, real-time	$960.00
AWS Transcribe	Standard	$1,440.00

At 1,000 hours the gap is stark in dollars. AssemblyAI processes the same volume for $150 that costs $1,440 on AWS standard. That is more than $15,000 saved per year versus AWS list pricing, before any AWS volume discount applies (which it does not at this volume).

10,000 Hours Per Month (600,000 minutes)

Enterprise scale: global contact centers, large media archives, compliance recording, or an API serving thousands of developers.

Provider	Model	Monthly cost
AssemblyAI	Universal-2	$1,500.00
OpenAI	gpt-4o-mini-transcribe	$1,800.00
AssemblyAI	Universal-3.5 Pro	$2,100.00
Google Cloud	Dynamic batch	$2,400.00
OpenAI	gpt-4o-transcribe	$3,600.00
Deepgram	Nova-3	$4,620.00
Google Cloud	Chirp, real-time	$9,600.00
AWS Transcribe	Standard, tiered	$11,250.00

At enterprise scale AWS's volume discounts finally apply: 600,000 minutes bills as the first 250,000 at $0.024 ($6,000) plus 350,000 at $0.015 ($5,250) for $11,250, instead of $14,400 at flat rate. Even so, AssemblyAI Universal-2 at $1,500 is roughly $9,750 per month cheaper than discounted AWS. At this volume, every provider will also negotiate custom rates, so treat list prices as the starting point, not the final bill.

Hidden Costs You Should Know

The per-minute rate is only part of the bill. Several other costs can move your total spend meaningfully.

Minimum Charges and Rounding

How a provider rounds matters when you process many short clips:

AWS Transcribe: one-second increments with a 15-second minimum per request. A 4-second clip bills as 15 seconds.
Deepgram: billed per second, no documented minimum, which is friendliest for short clips.
Google Cloud: historically billed in 15-second increments, rounded up.
AssemblyAI streaming: billed by session duration, so idle connection time counts.

If your workload is thousands of tiny clips (voice commands, short voicemails), these rules can swing your effective per-minute cost well above the headline rate.

Data Transfer and Egress Fees

Google Cloud and AWS charge for data egress when results leave their cloud. As a rough guide:

Google Cloud: around $0.12 per GB for the first tier of egress per month
AWS: around $0.09 per GB for the first tier of egress per month
AssemblyAI, OpenAI, Deepgram: no egress fees, results returned in the API response

Transcript text is tiny (a few KB per audio hour), so egress on results is negligible. The real data cost is uploading source audio to cloud storage first. One hour of CD-quality WAV is roughly 635 MB, so 10,000 hours per month means ingesting hundreds of terabytes of audio.

Audio Storage Costs

If you retain source audio for compliance, reprocessing, or QA:

Google Cloud Storage (Standard): about $0.020 per GB/month
AWS S3 (Standard): about $0.023 per GB/month
Cloudflare R2: about $0.015 per GB/month, no egress fees

For large archives, a no-egress provider like Cloudflare R2 saves real money over time. Storage rates change, so confirm against the current pricing page for your region.

Audio Preprocessing and File Limits

Not every API accepts every format natively, and a few have file-size caps that force chunking:

OpenAI has historically enforced a 25 MB per-request file-size limit, so long recordings must be split.
AWS Transcribe expects audio in S3 or streamed in specific formats.
Deepgram, AssemblyAI, and Google Cloud accept a wide range of input formats.

If your source is video or an uncommon codec, factor in FFmpeg processing time, compute for audio extraction, and temporary storage for intermediate files. That engineering overhead is a real cost even though it never shows up on the API invoice.

SDK and Integration Overhead

Google Cloud and AWS both require their cloud SDKs, IAM setup, and authentication plumbing. AssemblyAI, OpenAI, and Deepgram offer simpler REST APIs you can hit with a single authenticated HTTP request. For a small team, that difference is days of saved integration and maintenance time.

Hidden costs in a speech-to-text pipeline beyond the per-minute rate

Which API Wins for Your Use Case?

There is no single best speech-to-text API. The right pick depends on volume, features, existing infrastructure, and budget. Here is the honest breakdown.

Best for Lowest Cost

Winner: AssemblyAI Universal-2 ($0.0025/min) or OpenAI gpt-4o-mini-transcribe ($0.003/min)

If price is the deciding factor and you need a full-featured API with diarization, AssemblyAI Universal-2 is the floor at $0.15 per hour. If you only need batch transcription and can live without real-time streaming, OpenAI's mini model at $0.003 per minute is right behind it with strong accuracy. Google's dynamic batch tier ($0.004/min) is a third strong budget option when a 24-hour turnaround is acceptable.

Best for Real-Time and Streaming

Winner: Deepgram or AssemblyAI

Deepgram is built for low-latency streaming with interim results, endpointing, and per-second billing, and it remains a top pick for live captioning and voice agents even at its higher Nova-3 rate. AssemblyAI's Universal Streaming at $0.15 per hour is competitive on price with intelligent endpointing and diarization. OpenAI is not an option here because its transcription API is batch only.

Best for Enterprise and High Volume

Winner: AWS Transcribe (with volume discounts) or AssemblyAI (with an enterprise agreement)

At enterprise scale the conversation moves from list prices to negotiated rates. AWS's automatic volume discounts at 250K, 1M, and 5M minutes per month make it competitive without a sales call, and the integration is seamless if you already run on AWS. If you are not locked into a cloud, AssemblyAI's per-hour rate stays cheaper at every volume tier and it negotiates enterprise pricing on top.

Best for Multilingual Applications

Winner: Google Cloud or AssemblyAI

Google Cloud's Chirp models cover a very broad set of languages with strong non-English accuracy, all at the flat $0.016 per minute rate. AssemblyAI publishes 99-language coverage for async transcription at $0.0025 per minute, which makes it both broad and cheap for finished files. OpenAI's models handle many languages and code-switching well but at batch-only.

Best for Speaker Diarization

Winner: AssemblyAI, with OpenAI now a real contender

AssemblyAI includes diarization in its base async rate. The notable 2026 change is OpenAI's gpt-4o-transcribe-diarize, which adds built-in speaker labels to the Whisper family for the first time and can map segments to known speakers. Deepgram, Google Cloud, and AWS all support diarization as well, so this is no longer a differentiator that forces a single choice.

When a Managed Service Beats the Raw API

Every option above is a raw API. The real question for a lot of teams is build versus buy. The list price per minute is the smallest part of the total cost once you add integration engineering, authentication, audio preprocessing (FFmpeg, format conversion, chunking around file-size limits), retry logic, storage, and the ongoing maintenance of all of it.

ConvertAudioToText plans: a flat $9.99/month for unlimited transcription instead of a metered per-minute API bill

That is the gap ConvertAudioToText fills. We run a multi-engine transcription pipeline: AssemblyAI as the primary engine for long-form and non-English audio, Deepgram as a fallback, and Cloudflare's free Whisper for the free tier. We picked AssemblyAI as the primary specifically because of the price-to-feature balance laid out above, and we are telling you that directly so you can weigh our recommendation knowing where it comes from. You upload audio or paste a URL, and you get back a formatted transcript with speaker diarization, multiple export formats (SRT, VTT, TXT), and a subtitle generator and meeting transcription layer on top, with none of the API plumbing to maintain.

If you do want the raw API route, our best speech-to-text APIs guide walks through integration and code samples. Buy the managed service when your real cost is engineering time; integrate the API directly when you have the scale and the team to own the pipeline.

Frequently Asked Questions

What is the cheapest speech-to-text API in 2026?

AssemblyAI Universal-2 at $0.0025 per minute ($0.15 per hour) is the cheapest full-featured speech-to-text API, with diarization and 99-language support included. OpenAI's gpt-4o-mini-transcribe is close behind at $0.003 per minute for batch-only transcription. If a 24-hour turnaround is acceptable, Google Cloud's dynamic batch tier at $0.004 per minute is another strong budget option. Deepgram, which was the cheapest in 2024 and 2025, is now $0.0077 per minute with Nova-3 and no longer the budget winner.

Does Deepgram still have the lowest per-minute price?

No. Deepgram's Nova-2 rate of around $0.0043 per minute made it the budget pick in prior years, but the current Nova-3 pay-as-you-go rate is $0.0077 per minute ($0.46 per hour). AssemblyAI ($0.0025/min) and OpenAI's mini model ($0.003/min) are both cheaper now. Deepgram still competes strongly on streaming latency and per-second billing, just not on headline price. Some comparison sites still quote the old $0.0043 figure, so verify against Deepgram's live pricing page.

Does Google Cloud Speech-to-Text have a free tier?

Yes. Google Cloud offers 60 minutes of free transcription per month with no time limit, and it persists indefinitely. New Google Cloud accounts also receive $300 in platform credits that expire after 90 days. AWS Transcribe's free tier is also 60 minutes per month but only for the first 12 months, after which every minute is billable.

Can the OpenAI transcription API do speaker diarization now?

Yes. The old Whisper API could not label speakers, but OpenAI released gpt-4o-transcribe-diarize in October 2025, which adds built-in speaker identification and can map segments to up to four known speakers using reference audio. The main limitation is that OpenAI's transcription API is still batch only, so there is no real-time streaming the way Deepgram, AssemblyAI, and Google offer it. The diarize model also requires a chunking strategy for audio longer than 30 seconds.

How do I estimate my monthly speech-to-text API cost?

Calculate your monthly audio volume in minutes, then multiply by the per-minute rate of your chosen provider and model. For example, 500 hours per month on AssemblyAI Universal-2 is 500 hours x 60 minutes x $0.0025 = $75 per month. Watch the billing unit: AWS has a 15-second minimum per request, AssemblyAI streaming bills by session duration (idle time included), and Deepgram bills per second. Then add hidden costs like audio storage, data egress on Google Cloud and AWS, and any preprocessing compute for format conversion.

Is streaming or batch transcription cheaper?

Batch (pre-recorded) is usually cheaper or equal because real-time streaming needs dedicated infrastructure and WebSocket management. Google's dynamic batch is $0.004 per minute versus $0.016 for real-time, a 75% difference. AssemblyAI charges $0.15 per hour for both async and base streaming, but streaming bills by session duration rather than audio length. If your workload does not need an instant answer, batch is almost always the better economic choice.

Can I switch speech-to-text APIs without rebuilding my application?

The core workflow (send audio, receive text) is similar across providers, so switching mainly means changing your integration code, authentication, and possibly your preprocessing pipeline. To keep switching cheap, abstract your transcription logic behind an internal interface so you can swap providers without touching the rest of the app. A managed service like ConvertAudioToText handles that abstraction for you and runs multiple engines under the hood, so you get reliable transcription without owning the integrations.

The Bottom Line

The 2026 verdict is different from 2024. AssemblyAI and OpenAI's mini model are now the price leaders, Deepgram moved to the middle of the pack with Nova-3, and Google's V2 batch tier is a quietly excellent deal for non-urgent work. Shortlist on the features your workload actually needs, run your own real volume through the cost tables above, and confirm every number against the vendor's live pricing page before you commit, because these rates move and at least one (AssemblyAI's in-region pricing) was set to change days after this was published.

If your true constraint is engineering time rather than per-minute cost, a managed pipeline like ConvertAudioToText removes the integration and preprocessing burden entirely. If you have the scale and the team, integrate the cheapest API that meets your feature needs directly.

Which speech-to-text API is best for high volume?

At enterprise scale, AWS Transcribe's automatic volume discounts (down to $0.0078 per minute above 5 million minutes a month) make it competitive without a sales call, especially if your audio already lives in S3. If you are not tied to a cloud, AssemblyAI stays cheaper at every volume tier on list price and negotiates custom enterprise rates on top. Treat all list prices as a starting point at this scale because every provider negotiates.

Try transcription free

Convert any audio or video to clean, unwatermarked text — speaker labels, timestamps, and AI summaries included. First 10 minutes free, no account.

apidevelopers

Best Transcription Tools with API Access (2026)

Which transcription SaaS tools actually give you API keys, and on which plan? Verified pricing and plan gates for Descript, Sonix, Fireflies, Happy Scribe, AssemblyAI, and more.

May 26, 202612 min

apitranscription

Transcription API Comparison 2026: Dev Decision Matrix

Verified pricing, accuracy, and feature table for Deepgram, AssemblyAI, OpenAI, AWS Transcribe, and Google Cloud STT. Pick the right API for your use case.