apitranscriptioncomparisonrankingpricing

Best Speech-to-Text APIs in 2026 (Real Prices)

BMMamane B. MoussaFebruary 23, 2026Updated June 30, 202616 min read

Summarize this article with:

TL;DR

There is no single "best" speech-to-text API in 2026. For pre-recorded audio on a budget, AssemblyAI Universal-2 ($0.15/hr) and Rev Reverb ($0.20/hr) are now the cheapest full-featured options, with Cloudflare Workers AI Whisper (~$0.027/hr) cheaper still if you do not need diarization or streaming. For real-time streaming, Deepgram Nova-3 ($0.0048/min) leads on latency. For audio you also want to summarize or analyze, AssemblyAI is the one-call pick. Prices below are from each vendor's own pricing page as of June 2026.

The honest answer: there is no single best speech-to-text API in 2026

If you came here for a ranked list where #1 wins for everyone, the honest answer is that no such list exists. The right speech-to-text (STT) API depends on three things: whether your audio is live or pre-recorded, whether you need to know who said what, and how much volume you actually process. A voice agent that needs sub-200ms streaming has nothing in common with an overnight podcast pipeline that wants the lowest cost per hour.

What changed since most "best STT API" roundups were written is that prices moved a lot, and several old talking points are now wrong. AssemblyAI dropped to $0.15/hr for its Universal-2 model. Rev cut its API to $0.20/hr. OpenAI added speaker diarization and a realtime model, so the old "Whisper has no streaming and no diarization" line no longer holds. Speechmatics now publishes self-serve pricing instead of hiding behind a sales call. This guide uses prices pulled from each provider's own pricing page in June 2026, normalized so you can compare them on the same basis.

I build ConvertAudioToText on top of these engines, so this is the comparison I wish someone had handed me. For a deeper cost-only breakdown, see our speech-to-text API pricing comparison.

How to read the prices (this trips everyone up)

Vendors quote prices on different bases, and that is where most comparisons go wrong. Three things make a "cheap" headline price misleading:

Per-minute vs per-hour. $0.0077/min and $0.46/hr are the same number. We show both in every table.
Batch vs streaming. Pre-recorded (batch) is almost always cheaper and more accurate than real-time streaming, because the model sees the whole file before committing. If your audio is already recorded, never pay streaming rates.
Add-ons. Speaker diarization, sentiment, summarization, and PII redaction are frequently billed on top of the base transcription rate. A "$0.15/hr" model can become $0.23/hr once you turn on diarization. We call out the add-on cost where it matters.

All dollar figures below were verified against each vendor's published pricing page in June 2026. Speech-to-text pricing changes often, so treat these as a snapshot, not a contract, and click through to confirm before you commit budget.

Published pricing verified against each vendor page; flat-rate plans as the benchmark

The 2026 price comparison at a glance

Here is every provider on the same basis: pre-recorded (batch) price, what diarization adds, streaming availability, and the free tier. Sorted by batch price, cheapest first.

Provider	Model	Batch price	Diarization add-on	Streaming	Free tier
Cloudflare Workers AI	Whisper	~~$0.0005/min (~~$0.027/hr)	Not native	No	10,000 neurons/day (~240 min)
AssemblyAI	Universal-2	$0.0025/min ($0.15/hr)	+$0.02/hr	Yes ($0.15/hr)	$50 credit
Rev	Reverb	$0.0033/min ($0.20/hr)	Included	Yes	5 hours of credits
Google Cloud	Dynamic Batch	$0.004/min ($0.24/hr)	Yes	No (batch tier)	60 min/month
OpenAI	gpt-4o-mini-transcribe	$0.003/min ($0.18/hr)	No	Yes (realtime model)	None
OpenAI	gpt-4o-transcribe	$0.006/min ($0.36/hr)	Included	Yes (realtime model)	None
AWS Transcribe	Standard (batch, tier 1)	$0.006/min ($0.36/hr)	Yes	Yes	60 min/month (12 mo)
Deepgram	Nova-3 (batch)	$0.0077/min ($0.46/hr)	+$0.002/min	Yes ($0.0048/min)	$200 credit
Speechmatics	Melia 1 / Standard	from $0.129/hr (multilingual)	Yes	Yes (~$0.40/hr)	480 min/month
Google Cloud	Standard (Chirp)	$0.016/min ($0.96/hr)	Yes	Yes	60 min/month
AWS Transcribe	Medical	$0.075/min ($4.50/hr)	Yes	Yes	60 min/month (12 mo)

Batch transcription: cost per hour

Cloudflare Workers AI

$0.03/hr

AssemblyAI

$0.15/hr

OpenAI 4o-mini

$0.18/hr

Rev

$0.20/hr

Google Cloud

$0.24/hr

OpenAI 4o

$0.36/hr

Published batch price per hour, 2026. Lower is cheaper.

A few things jump out. The cheapest full-featured pre-recorded option in 2026 is AssemblyAI at $0.15/hr, not Deepgram, which now sits at $0.46/hr for Nova-3 batch. Deepgram is still the streaming price leader at $0.0048/min once you are doing real-time work. Cloudflare Workers AI is the genuine budget outlier at roughly $0.027/hr, with the catch that it has no native diarization or streaming. And AWS Standard batch is now $0.006/min, far below the $0.024/min figure that older articles still quote (that was the streaming rate).

Note on the July 1, 2026 AssemblyAI change: AssemblyAI is raising in-region model pricing by 10% effective July 1, 2026. You can keep the current rate by adding "model_region": "global" to your API requests. If you are reading this after that date, treat the AssemblyAI line as the global-region price.

Provider-by-provider: what each one is actually good at

Deepgram Nova-3

Deepgram built its name as the streaming-first, latency-obsessed STT provider, and that is still where it wins. Nova-3 streaming is $0.0048/min for monolingual ($0.288/hr), with sub-300ms latency that is hard to beat for live captions, voice agents, and meeting tools. New accounts get a $200 free credit with no card required, the most generous free credit in the field.

The honest catch is that Deepgram is no longer the budget batch option. Nova-3 pre-recorded is $0.0077/min ($0.46/hr) monolingual, and diarization is a separate $0.002/min add-on. If you only process recorded audio and do not need its real-time strengths, you are paying a premium for capabilities you are not using. Deepgram bills per second with no rounding, which helps on lots of short clips.

Best for: real-time streaming, voice agents, and teams that value the fastest latency over the lowest batch price. See our Deepgram Nova-3 deep dive for model details.

AssemblyAI Universal

AssemblyAI is the value story of 2026. Universal-2 pre-recorded is $0.15/hr ($0.0025/min), and the newer Universal-3.5 Pro is $0.21/hr. Standard async diarization is a small +$0.02/hr add-on, so a fully diarized transcript on Universal-3.5 Pro lands around $0.23/hr, still cheaper than Deepgram batch before diarization.

What sets AssemblyAI apart is its audio-intelligence layer. In one API call you can get the transcript plus summarization, sentiment, entity detection, topic chapters, content moderation, and PII redaction, plus natural-language keyterm prompting (up to 1,500 words) to bias vocabulary on Universal-3.5 Pro. If you want structured data out of audio, not just text, this avoids chaining a separate LLM. New accounts get $50 in free credits, no card required. The older SLAM-1 model is deprecated; new builds should target Universal-3.5 Pro.

Best for: pre-recorded audio on a budget, and any product that needs analysis (summaries, chapters, redaction) alongside the transcript.

OpenAI (gpt-4o-transcribe and Whisper)

OpenAI quietly fixed the two biggest knocks against Whisper. There are now three transcription models: gpt-4o-mini-transcribe at $0.003/min ($0.18/hr), gpt-4o-transcribe at $0.006/min ($0.36/hr) with speaker diarization included, and a realtime model for live transcription. So the old "Whisper has no streaming and no diarization" complaint is outdated as of 2026.

The original Whisper API remains at $0.006/min and is still open source, so you can self-host the weights with faster-whisper or whisper.cpp to trade per-minute fees for your own GPU cost. The remaining real constraint is the 25 MB file size limit, which forces chunking logic on long recordings. Multilingual accuracy is a genuine strength. See our OpenAI Whisper API pricing analysis and the Whisper large-v3 explainer for model internals.

Best for: multilingual batch jobs, teams already in the OpenAI ecosystem, and anyone who wants a self-host fallback.

Google Cloud Speech-to-Text

Google Cloud still owns the multilingual breadth crown with 125+ languages and the Chirp model line. Standard recognition is $0.016/min ($0.96/hr), which is on the expensive end. The smart move is Dynamic Batch at $0.004/min ($0.24/hr) when you can wait up to 24 hours for results, a 75% discount over the standard rate and one of the cheapest paths to a transcript at scale. The free tier is 60 minutes per month indefinitely.

Best for: global products needing the widest language coverage, and high-volume batch jobs that can tolerate Dynamic Batch latency. Full breakdown in our Google Cloud Speech-to-Text pricing guide.

AWS Transcribe

AWS Transcribe is the AWS-native choice. Standard batch tier 1 is $0.006/min ($0.36/hr) with steep volume discounts that drop the rate at 250,000+ minutes/month, so it gets cheaper as you scale. Its standout feature is the Medical model at $0.075/min ($4.50/hr), purpose-built for clinical conversations and HIPAA-bound workflows, one of the few APIs trained for that domain. The free tier is 60 minutes/month for the first 12 months, then it ends.

Best for: AWS-native architectures, healthcare and clinical transcription, and teams that need audio to stay inside AWS. See our AWS Transcribe pricing breakdown.

Rev (Reverb)

Rev trained its Reverb model on millions of hours of professionally transcribed media audio, and it shows on broadcast-quality content. The API is now $0.20/hr ($0.0033/min) for Reverb, with a faster Reverb turbo at $0.10/hr, and speaker diarization is included by default (up to 8 speakers) at no extra cost. New accounts get 5 hours of free credits. The big change from older guides: Rev's API used to be $0.02/min, so the price fell roughly 6x.

Best for: podcasts, broadcast, and professionally produced media where included diarization and clean output matter.

Speechmatics

Speechmatics is the accent-and-dialect specialist out of the UK, strong on British, Australian, Indian, and South African English plus European languages. The notable 2026 change is that pricing is now self-serve and published rather than sales-only. The multilingual Melia 1 batch model starts from $0.129/hr, real-time runs from roughly $0.40/hr, and the free plan includes 480 minutes/month with on-prem container deployment available for data-sovereignty needs.

Best for: diverse English accents, European languages, and organizations that need on-premise deployment.

Cloudflare Workers AI (Whisper)

The option almost no roundup mentions: Cloudflare Workers AI runs Whisper at the edge for 41.14 neurons per audio minute at $0.011 per 1,000 neurons, which works out to roughly $0.027/hr (whisper-large-v3-turbo is ~$0.031/hr). There is a free allocation of 10,000 neurons/day, about 240 free minutes of Whisper daily. It is by far the cheapest paid transcription here.

The tradeoffs are real: no native speaker diarization, no streaming, and Whisper's known limits (dialect bias, 25 MB chunking). But for high-volume, single-speaker, non-real-time work where you just need text, it is unmatched on cost.

Best for: cheap, high-volume batch transcription where diarization and streaming are not required.

Accuracy: read the benchmarks, then test your own audio

Accuracy headlines lie more than price headlines do. Every vendor benchmarks on a dataset that flatters it, so word error rate (WER) numbers from two different sources do not compose. A "5% WER batch" claim and an "18% WER real-world" claim can both be true for the same model on different audio.

The most useful neutral reference is an independent leaderboard like Artificial Analysis AA-WER v2.0, which on clean English audio currently shows Deepgram Nova-3, OpenAI gpt-4o-transcribe, and AssemblyAI Universal all clustered in the single-digit to low-teens WER range, close enough that the difference rarely decides a project. On heavily accented or noisy audio the gaps widen, and that is exactly where your own samples matter.

The reliable rule: take a 10-minute clip of your real audio (your accents, your background noise, your domain words), run it through the two or three finalists, and read the transcripts yourself. That single test tells you more than any published WER. Our guide to improving transcription accuracy covers the levers that move real-world WER more than the model choice does.

How to choose: a decision path that matches real intent

Use this to narrow to one or two candidates, then run the 10-minute test above.

Is your audio live (streaming) or already recorded (batch)?

Live: jump to the streaming question below.
Recorded: keep going.

For recorded audio, is the lowest cost your top priority?

Yes, and you do not need diarization or streaming: Cloudflare Workers AI Whisper (~$0.027/hr) or OpenAI gpt-4o-mini-transcribe ($0.18/hr).
Yes, but you need diarization: AssemblyAI Universal-2 ($0.15/hr + $0.02/hr diarization) or Rev Reverb ($0.20/hr, diarization included).

Do you also need summaries, chapters, sentiment, or PII redaction?

Yes: AssemblyAI does it in one call.

Do you need streaming / real-time?

Lowest latency: Deepgram Nova-3 ($0.0048/min).
Broadest languages live: Google Cloud (Chirp).
Already on a cloud: AWS Transcribe streaming or Google Cloud streaming.

Do you have a special domain or market?

Medical / HIPAA: AWS Transcribe Medical.
Diverse English accents or on-premise: Speechmatics.
Widest language list: Google Cloud (125+).

My honest default for a new 2026 build: if you are processing recorded audio and want one provider that is cheap and feature-complete, start with AssemblyAI. If you are building anything real-time, start with Deepgram. If you just need bulk text at the lowest possible cost, look at Cloudflare Workers AI. None of these is "the best" in the abstract; each is the best for a specific job.

Don't want to build against an API at all?

If you need transcripts but not a code integration, you do not have to wire up any of these. ConvertAudioToText is a no-code tool that handles upload, audio extraction, diarization, and export to SRT, VTT, and TXT in the browser. It routes across several of the engines above (AssemblyAI, Deepgram, and Cloudflare Whisper) under the hood, so you get the strengths of each without managing keys or fallbacks yourself. You can also generate subtitles, transcribe a meeting, or convert a video to text directly. It is the path I reach for when the job is "I need the transcript," not "I am shipping a transcription feature."

A note from Bello

I have spent a lot of 2026 wiring these engines together for real users, and the single biggest mistake I see is picking a provider off a year-old "best STT" list. The prices in those lists are usually wrong now, and the feature gaps they describe (Whisper has no diarization, Rev is expensive, Speechmatics hides its pricing) have mostly closed. Pick on your actual audio and your actual volume, verify the price on the vendor's own page the week you commit, and run the 10-minute test. That beats any ranking, including this one.

Frequently asked questions

The FAQ is provided in the structured FAQ section.

Frequently Asked Questions

What is the cheapest speech-to-text API in 2026?

For paid full-featured transcription, AssemblyAI Universal-2 at $0.15/hr ($0.0025/min) and Rev Reverb at $0.20/hr ($0.0033/min, diarization included) are the cheapest pre-recorded options. Cloudflare Workers AI running Whisper is cheaper still at roughly $0.027/hr, but it has no native speaker diarization or streaming. Google Cloud Dynamic Batch ($0.004/min) and OpenAI gpt-4o-mini-transcribe ($0.003/min) are also in the budget tier. Deepgram, often cited as cheapest in older articles, is now $0.46/hr for Nova-3 batch. Prices verified June 2026.

Which speech-to-text API is the most accurate?

There is no single most-accurate API, because every vendor benchmarks on audio that flatters it and word error rates from different sources do not compare. On independent leaderboards for clean English audio, Deepgram Nova-3, OpenAI gpt-4o-transcribe, and AssemblyAI Universal cluster closely in the low single-digit to low-teens WER range. The gap widens on accented or noisy audio. The reliable test is to run a 10-minute clip of your own real audio through your two or three finalists and read the transcripts yourself.

Which API is best for real-time streaming transcription?

Deepgram Nova-3 leads on streaming latency at $0.0048/min (monolingual) with sub-300ms latency, which is why it is the common choice for voice agents and live captions. AssemblyAI Universal Streaming ($0.15/hr) and OpenAI's realtime transcription model are strong alternatives, and Google Cloud is the pick when you need the widest language coverage in real time. For pre-recorded audio you should always use the cheaper, more accurate batch tier rather than streaming.

Do these APIs include speaker diarization, and does it cost extra?

It varies. Rev Reverb includes diarization by default (up to 8 speakers) at no extra cost, and OpenAI's gpt-4o-transcribe includes it. AssemblyAI charges a small add-on of about +$0.02/hr for standard async diarization. Deepgram charges +$0.002/min as a separate add-on. AWS Transcribe and Google Cloud support it. Cloudflare Workers AI Whisper has no native diarization. Always check whether your headline price includes diarization, because add-ons can raise a $0.15/hr model to $0.23/hr.

Can I use OpenAI Whisper for free?

The OpenAI Whisper and gpt-4o-transcribe APIs have no free tier and charge per minute ($0.006/min for Whisper and gpt-4o-transcribe, $0.003/min for gpt-4o-mini-transcribe). However, the Whisper model is open source, so you can download the weights and run it yourself with tools like faster-whisper or whisper.cpp, trading per-minute fees for your own GPU cost. For genuinely free hosted transcription, Cloudflare Workers AI gives about 240 minutes of Whisper per day at no charge, or you can use a no-code tool like ConvertAudioToText.

Is Deepgram still the best choice in 2026?

Deepgram is still excellent for real-time streaming, where Nova-3 at $0.0048/min and sub-300ms latency is hard to beat, and it offers the most generous free credit ($200). But it is no longer the cheapest for pre-recorded audio: Nova-3 batch is $0.0077/min ($0.46/hr) plus a $0.002/min diarization add-on, which is several times more than AssemblyAI or Rev. If you only process recorded files, you are likely overpaying with Deepgram; if you build real-time products, it is still a top pick.

What changed in speech-to-text API pricing since 2025?

Prices fell sharply and several old talking points became false. AssemblyAI dropped to $0.15/hr, Rev's API fell from roughly $0.02/min to $0.20/hr, and AWS Standard batch is $0.006/min (older articles quote $0.024, which was the streaming rate). OpenAI added speaker diarization to gpt-4o-transcribe and a realtime model, so Whisper is no longer streaming-less or diarization-less. Speechmatics now publishes self-serve pricing instead of requiring a sales call. Note that AssemblyAI raises in-region pricing 10% on July 1, 2026, avoidable with model_region: global.

Should I use an STT API or a no-code transcription tool?

Use an API if you are shipping transcription as a feature inside your own product and need programmatic control, custom vocabulary, or webhooks. Use a no-code tool if you just need the transcript itself, for a meeting, interview, podcast, or video. A no-code tool like ConvertAudioToText handles upload, audio extraction, diarization, and export to SRT, VTT, and TXT without you managing API keys, fallbacks, or chunking. It routes across multiple engines internally so you get their strengths without the integration work.

Try transcription free

Convert any audio or video to clean, unwatermarked text — speaker labels, timestamps, and AI summaries included. First 10 minutes free, no account.

transcriptioncomparison

Speechmatics Alternative for Non-Developers: Web Transcription Without Code

Speechmatics is genuinely excellent for developers: 50 hours free per month, 56 languages, on-prem deployment. If you need a drag-and-drop web app with flat $9.99/mo pricing instead of an API, here is an honest comparison of the two.

Jul 16, 202610 min

apidevelopers

Best Transcription Tools with API Access (2026)

Which transcription SaaS tools actually give you API keys, and on which plan? Verified pricing and plan gates for Descript, Sonix, Fireflies, Happy Scribe, AssemblyAI, and more.

May 26, 202612 min