apitranscriptionpricingopenaiwhisper

OpenAI Whisper API Pricing 2026: $0.003–$0.006/min

BMMamane B. MoussaFebruary 23, 2026Updated June 30, 202618 min read

Summarize this article with:

TL;DR

In 2026 there is no longer a single "Whisper API price." The legacy whisper-1 model is still $0.006 per minute, but OpenAI now offers gpt-4o-mini-transcribe at $0.003/min (half the cost), gpt-4o-transcribe at $0.006/min with better accuracy, gpt-4o-transcribe-diarize at the same rate with speaker labels, plus realtime variants. The catch: only whisper-1 still returns word-level timestamps and SRT/VTT, so subtitle and video workflows often still need it. Every model shares the same 25MB and 25-minute file limit.

OpenAI Whisper API Pricing in 2026: The Short Version

When people say "Whisper API pricing," they usually mean one number: $0.006 per minute. That number is still correct for the original whisper-1 model. But in 2026 it is no longer the whole story, and treating it as one flat price is how teams end up paying twice what they need to.

OpenAI now runs several speech-to-text models on the same /v1/audio/transcriptions endpoint. whisper-1 is the legacy model. Alongside it sit gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-4o-transcribe-diarize, plus realtime variants for live audio. The cheapest OpenAI transcription option in 2026 is gpt-4o-mini-transcribe at $0.003 per minute, which is half the price of Whisper. If you are still defaulting to whisper-1 out of habit, you may be overpaying.

Here is the full lineup with current rates, pulled from OpenAI's pricing page:

Model	Price per minute	Price per hour	What it adds over Whisper
gpt-4o-mini-transcribe	$0.003	$0.18	Cheapest; better accuracy than Whisper
whisper-1 (legacy)	$0.006	$0.36	Word timestamps, SRT/VTT output
gpt-4o-transcribe	$0.006	$0.36	Higher accuracy; supports streaming
gpt-4o-transcribe-diarize	$0.006	$0.36	Speaker labels (who said what)
gpt-realtime-whisper	$0.017	$1.02	Live, low-latency streaming
gpt-realtime-translate	$0.034	$2.04	Live transcription plus translation

OpenAI speech-to-text: cost per hour by model

gpt-4o-mini

$0.18/hr

whisper-1

$0.36/hr

gpt-4o-transcribe

$0.36/hr

gpt-4o-diarize

$0.36/hr

gpt-realtime

$1.02/hr

realtime-translate

$2.04/hr

OpenAI list price per hour by model, 2026.

A quick read of this table: for plain batch transcription at the lowest cost, gpt-4o-mini-transcribe wins. For the best accuracy at the same price as Whisper, use gpt-4o-transcribe. For speaker labels, use the diarize variant. You should reach for the legacy whisper-1 only when you specifically need word-level timestamps or subtitle files, because the newer gpt-4o models dropped those outputs. More on that below, because it is the detail most pricing articles miss.

I am Bello M. Amadou. I work on transcription tooling, and I have watched a lot of teams pick a model on price alone and then discover that the cheap one cannot produce the SRT file their video editor needs. This guide is meant to save you that round trip.

What Changed Since the "$0.006 Flat Rate" Days

The original Whisper API launched as a deliberately simple product: one model, one price, no tiers. That simplicity was its selling point, and a lot of older articles still describe it that way. It is now out of date.

In late 2025 OpenAI shipped a new generation of audio models built on its GPT-4o architecture, and they became production stable through 2026. The practical effects:

The price floor dropped. gpt-4o-mini-transcribe runs at $0.003 per minute, undercutting the old Whisper rate by 50 percent. For high-volume, accuracy-tolerant work like internal search indexing or rough drafts, this is the new default.

Accuracy improved at the same price. gpt-4o-transcribe costs the same $0.006 per minute as whisper-1 but is positioned by OpenAI as a more accurate transcription model. Independent reviewers report a lower word error rate than Whisper v3 on benchmark audio, though the exact figures vary by test set, so treat any single percentage you see online as directional rather than gospel.

Speaker diarization is now native. The old Whisper API had no built-in way to label speakers. gpt-4o-transcribe-diarize adds automatic speaker detection and returns a diarized_json response with speaker labels and segment timestamps, at the same per-minute rate. You can even supply short reference clips for up to four known speakers.

Real-time transcription exists. The classic Whisper API was batch only. OpenAI now offers realtime transcription through gpt-realtime-whisper ($0.017/min) and a translate-as-you-speak variant ($0.034/min), and gpt-4o-transcribe itself supports streaming responses. The blanket claim that "the Whisper API cannot do real time" is now only true of whisper-1 specifically.

Whisper is still here, and still open source. OpenAI did not retire whisper-1. It remains in the model list at $0.006 per minute, and the open-source Whisper weights are still freely downloadable from GitHub for self-hosting. So this is not a rename. It is a wider menu, with Whisper as one option on it.

The Detail Most Articles Miss: Why You Still Need whisper-1

Here is the single most useful thing in this guide, and it is the reason you cannot just blindly switch everything to the cheaper gpt-4o models.

The gpt-4o transcription models dropped word-level timestamps and subtitle outputs. According to OpenAI's speech-to-text guide, whisper-1 supports the json, text, srt, verbose_json, and vtt response formats, and the timestamp_granularities[] parameter for word and segment timing. The gpt-4o-transcribe and gpt-4o-mini-transcribe models support only json and plain text. Requesting verbose_json from a gpt-4o model returns an error telling you to use json or text instead.

What that means in practice:

If you need...	Use this model	Why
Cheapest bulk transcription	gpt-4o-mini-transcribe	$0.003/min, json/text only
Best accuracy, plain transcript	gpt-4o-transcribe	$0.006/min, json/text only
Speaker labels (who said what)	gpt-4o-transcribe-diarize	$0.006/min, diarized_json
SRT or VTT subtitle files	whisper-1	Only model that outputs them
Word-level timestamps	whisper-1	Only model with timestamp_granularities
Live captions / voice apps	gpt-realtime-whisper	Streaming, $0.017/min

So if your workflow is building captions for video, generating subtitles for a YouTube upload, or aligning a transcript to a timeline, the newer and "better" models will not give you the timing data you need. You either stay on whisper-1, or you run a gpt-4o transcript through a separate alignment step. For a lot of subtitle pipelines, staying on Whisper is simply the right call, and that is a legitimate reason it is not going anywhere.

The 25MB and 25-Minute File Limits

Every OpenAI transcription model, old and new, shares the same two hard limits on a single API request:

25MB maximum file size. Files over 25MB are rejected outright.
1500 seconds (25 minutes) maximum audio duration. Longer audio returns a duration-exceeded error, confirmed by multiple developers in the OpenAI community forum.

The duration cap catches people off guard because a one-hour podcast compressed to MP3 can sit comfortably under 25MB on size yet still blow past the 25-minute limit. You have to satisfy both constraints.

What 25MB looks like across formats:

Format	Approximate duration at 25MB
MP3 at 128 kbps	~26 minutes
MP3 at 64 kbps	~52 minutes
WAV (16-bit, 16 kHz mono)	~13 minutes
M4A at 128 kbps	~26 minutes
MP4 video	3-10 minutes (varies widely)

Note that even where the file size allows 52 minutes (64 kbps MP3), the 1500-second duration cap still forces you to split anything over ~25 minutes. So in 2026 the duration limit, not the file size, is usually the binding constraint.

How to get around the limits

Compress before uploading. For speech, converting a WAV to a 64 kbps MP3 cuts file size by 90 percent or more with no meaningful accuracy loss, because 64 kbps preserves the frequency range that speech recognition actually uses. FFmpeg does this in one command.

Extract audio from video first. A one-hour MP4 might be 500MB, but its audio track as an MP3 can be well under 25MB. Never upload video to the transcription endpoint; the API ignores the video stream anyway, so you are just wasting bandwidth and probably hitting the size limit for nothing.

Split long files at silence. For anything over 25 minutes you must chunk. Split at natural pauses rather than mid-sentence, or you get clipped words at every boundary that you then have to stitch back together. Aim for chunks around 20 to 24 minutes to stay safely under both caps.

Let a service handle it. If you would rather not build a compression and chunking pipeline, ConvertAudioToText does the preprocessing automatically. You upload a file of any length and it handles extraction, chunking, and reassembly behind the scenes. To be straight with you: CATT runs on Deepgram and AssemblyAI, not on Whisper, so this is a "do the plumbing for you" recommendation, not a Whisper wrapper. If you specifically want Whisper output, the OpenAI API or self-hosting is your path.

Self-Hosting Whisper: Free Software, Real Infrastructure Cost

Because Whisper is open source, you can run it on your own GPU and pay OpenAI nothing per minute. The obvious question is why anyone pays $0.006 (or $0.003) per minute when the model is free to download. The answer is that the model is free; the compute and the engineering are not.

Whisper's large-v3 weights need a capable GPU to run at usable speeds. Rough cloud costs:

GPU	Cloud cost per hour	Whisper speed	Cost per audio hour
NVIDIA A100 (80GB)	~$1.50 - $3.00	~6-10x real-time	$0.15 - $0.50
NVIDIA A10G	~$0.75 - $1.20	~3-5x real-time	$0.15 - $0.40
NVIDIA T4	~$0.35 - $0.50	~1-2x real-time	$0.18 - $0.50
NVIDIA L4	~$0.50 - $0.80	~3-4x real-time	$0.13 - $0.27

The speed column is what determines real cost. An A100 transcribing at 8x real time gets through eight audio hours per GPU hour, so the per-audio-hour cost lands around $0.15 to $0.40. A T4 running near real time gets you closer to one audio hour per GPU hour.

But raw GPU rental is the easy part of the bill. Self-hosting also means you own:

Infrastructure. Provisioning, autoscaling, failover, and uptime. Not a deploy-and-forget setup.
Job orchestration. A queue, workers, retries, and status tracking. You are rebuilding the backend that a managed API gives you for free.
Monitoring. GPU jobs fail in their own special ways: out-of-memory errors, CUDA version mismatches, driver drift. These eat engineering time.
Storage and networking. Moving audio to GPU nodes and storing results adds cost that is easy to forget when you model it on a spreadsheet.

When does self-hosting actually win on cost? As a rough guide, only at sustained high volume:

Monthly volume	API cost (at $0.003/min, mini)	Self-host estimate	Better option
Under 100 hours	$18	$150 - $400+	API
100 - 500 hours	$18 - $90	$200 - $600	API
500 - 2,000 hours	$90 - $360	$300 - $800	Depends
Over 2,000 hours	$360+	$400 - $1,200	Self-host (if you have the team)

I will be honest about my own bias here: the break-even looks great on paper and far worse once you price an engineer's time to build and babysit the pipeline. Note also that I dropped the API column to the $0.003 gpt-4o-mini-transcribe rate, which moves the break-even further out than the old $0.006 math suggested. If you are under a few hundred hours a month, the managed API is almost always the cheaper total cost of ownership once you count people.

OpenAI vs the Competition: Verified June 2026 Rates

Per-minute price is the visible number, but it rarely tells you total cost, because the cheap providers often bill extras separately. Here are current, verified pay-as-you-go rates for the major batch transcription APIs.

Service	Price per minute (batch)	Diarization	Streaming	Languages
OpenAI gpt-4o-mini-transcribe	$0.003	No	Yes (gpt-4o-transcribe)	99+
OpenAI whisper-1	$0.006	No	No	50+
OpenAI gpt-4o-transcribe-diarize	$0.006	Yes (included)	No	99+
AssemblyAI Universal	~$0.0035 ($0.21/hr Pro)	Add-on (+$0.02/hr)	Yes	99+
Deepgram Nova-3	$0.0043	Add-on (+$0.002/min)	Yes ($0.0077/min)	36+
Google Cloud STT v2	$0.016 (standard)	Yes	Yes	125+
AWS Transcribe	$0.024	Yes	Yes	100+
Azure Speech	$0.003 (batch) / $0.0167 (real-time)	Yes	Yes	100+

Sources: Deepgram pricing, AssemblyAI pricing, Google Cloud STT pricing, AWS Transcribe pricing, Azure Speech pricing.

What the table actually tells you

The cheap-vs-Whisper story flipped in 2026. Older comparisons claimed Deepgram undercut Whisper. At Deepgram Nova-3's current $0.0043 per minute for pre-recorded audio, it still beats the $0.006 whisper-1 rate, but it no longer beats OpenAI's own $0.003 gpt-4o-mini-transcribe. The genuinely cheapest credible batch options today are OpenAI's mini model, Azure's batch tier (both $0.003), and AssemblyAI's Universal model (around $0.0035 per hour-billed rate). Deepgram is competitive but not the price leader it once was for batch.

"Per minute" hides the add-ons. Both Deepgram and AssemblyAI bill speaker diarization, summarization, and other features as separate line items. AssemblyAI's base Universal rate is low, but Speaker Identification, sentiment, and summarization each add per-hour cost on top. Deepgram adds about $0.002 per minute for diarization. OpenAI's gpt-4o-transcribe-diarize, by contrast, folds speaker labels into the flat $0.006 rate with no surcharge, which can make it the cheaper option once you actually need diarization.

Enterprise providers charge more and bundle more. Google Cloud at $0.016 and AWS Transcribe at $0.024 cost three to four times the OpenAI mini rate, but they ship streaming, diarization, custom vocabulary, and real SLAs. AWS also volume-discounts down toward $0.0078 per minute at very high scale, and Google offers a $0.003 "dynamic batch" tier for non-urgent jobs. If you are already deep in GCP or AWS and value the integration and the SLA, the higher sticker price buys real things.

Language breadth still favors the big clouds and OpenAI. Google leads at 125+ languages, the gpt-4o models cover 99+, and AWS and Azure sit around 100. Deepgram's 36 is narrower. If you transcribe many languages from one API, OpenAI and Google are the strongest fits.

For a deeper side-by-side across every provider, see our speech-to-text API pricing comparison and the full best speech-to-text APIs ranking. For a head-to-head against Google specifically, read Whisper vs Google Cloud Speech.

When OpenAI Transcription Is the Right Choice

Despite the wider field, there are clear cases where OpenAI's audio models are the best fit.

A managed alternative: ConvertAudioToText pricing with a free tier and flat monthly plans, no per-minute metering

Multilingual batch transcription on a budget. With 99+ languages on the gpt-4o models and a $0.003 floor, transcribing interviews across French, Japanese, and Portuguese from a single API with consistent pricing is hard to beat. Most rivals either charge more for non-English or support fewer languages. You can try multilingual transcription and then translate the result without a separate service.

Subtitle and timestamp workflows (on whisper-1). As covered above, if you need SRT, VTT, or word timing, whisper-1 is one of the few low-cost APIs that emits subtitle files directly. That is a real, specific reason to use it.

Small-scale and prototype projects. Ten hours of audio costs $1.80 on the mini model. No subscription, no minimum, no infrastructure. For researchers, indie developers, and early-stage products, that is about as frictionless as transcription gets. The simple endpoint and good docs make a working prototype a quick job, and you can sanity-check the experience first with our free audio-to-text tool or URL-to-text.

Open-source escape hatch. Because Whisper is open source, you can start on the API for convenience and migrate identical-quality transcription to self-hosted Whisper later if your volume justifies it. Few proprietary services offer that continuity. For more on the no-API route, see our free audio-to-text converter guide.

Estimating Your Monthly OpenAI Transcription Bill

Costs at the two common rates, so you can pick the row that matches your model choice:

Use case	Monthly volume	At $0.003/min (mini)	At $0.006/min (whisper/transcribe)
Freelance journalist, interviews	10 hours	$1.80	$3.60
Small podcast team	20 hours	$3.60	$7.20
Research team, focus groups	50 hours	$9.00	$18.00
Content agency, client media	200 hours	$36.00	$72.00
Enterprise, call recordings	1,000 hours	$180.00	$360.00
Large media archive	5,000 hours	$900.00	$1,800.00

These cover transcription only. If you also need diarization, the gpt-4o-transcribe-diarize row matches the $0.006 column with no extra charge. If you need summaries, formatting, or subtitle generation layered on top, factor those in or use a platform like ConvertAudioToText that bundles transcription, summaries, and exports together at a flat monthly rate — see the pricing page for current plan details.

Frequently Asked Questions

How much does the OpenAI Whisper API cost per minute in 2026?

The legacy whisper-1 model still costs $0.006 per minute ($0.36 per hour), billed by the second on actual audio duration. But OpenAI now also offers gpt-4o-mini-transcribe at $0.003 per minute, which is half the Whisper price, and gpt-4o-transcribe at the same $0.006 with better accuracy. So "the Whisper API price" is no longer a single number. Check OpenAI's pricing page for the current official rates.

What is the difference between whisper-1 and gpt-4o-transcribe?

Both run on the same transcription endpoint, but gpt-4o-transcribe is the newer, more accurate model and supports streaming responses. The important catch is output formats: whisper-1 can return SRT, VTT, verbose_json, and word-level timestamps, while the gpt-4o models return only json or plain text. If you need subtitle files or precise word timing, you still have to use whisper-1. For plain transcripts or speaker-labeled output, the gpt-4o models are the better pick.

Is there a free tier for OpenAI transcription?

No. Every minute is billed; there is no perpetual free tier for any of OpenAI's audio models. New OpenAI accounts sometimes receive a small one-time credit that can be applied to transcription, but that is an introductory credit, not an ongoing free tier. For genuinely free transcription, you can self-host the open-source Whisper model (you pay for GPU instead of per minute) or use a free web tool for small jobs.

Can OpenAI do speaker diarization now?

Yes. The original Whisper API could not label speakers, but gpt-4o-transcribe-diarize adds automatic speaker detection and returns a diarized_json response with speaker labels and segment timestamps, at the same $0.006 per minute as standard transcription with no diarization surcharge. You can optionally supply short reference clips for up to four known speakers to improve labeling.

What is the file size limit for the OpenAI transcription API?

Every model shares a 25MB maximum file size and a 1500-second (25-minute) maximum audio duration per request. You must satisfy both, and the duration cap is usually the one that bites first, since a one-hour podcast can be under 25MB but is still well over 25 minutes. For longer files, extract the audio from video, compress to a lower-bitrate MP3, and split at silence into chunks under both limits.

Which is cheaper in 2026, OpenAI or Deepgram?

It depends on the OpenAI model. Deepgram Nova-3 at $0.0043 per minute for pre-recorded audio undercuts the $0.006 whisper-1 rate, but it does not beat OpenAI's $0.003 gpt-4o-mini-transcribe. Deepgram also charges roughly $0.002 per minute extra for diarization, whereas OpenAI's diarize model is the flat $0.006 with speaker labels included. So for cheap plain transcription, OpenAI's mini model wins; for streaming, Deepgram is stronger. Run the math for your exact feature set rather than comparing sticker prices alone.

Can I use the OpenAI Whisper API for real-time transcription?

The legacy whisper-1 model is batch only and cannot do real time. However, OpenAI now offers gpt-realtime-whisper at $0.017 per minute and gpt-realtime-translate at $0.034 per minute for live, low-latency streaming, and gpt-4o-transcribe supports streaming responses. So the old "Whisper can't do live audio" rule now applies only to the specific whisper-1 model, not to OpenAI's transcription lineup as a whole. For the lowest-latency live captioning, Deepgram is still a strong alternative to benchmark against.

How does API transcription compare to hiring a human transcriptionist?

Human transcription services typically charge $1.00 to $3.00 per audio minute, so at $0.003 to $0.006 per minute the OpenAI API is roughly 200 to 1,000 times cheaper. Humans still win on hard audio with heavy accents, overlapping speakers, or significant noise, where their accuracy is hard to match. For clean, single-speaker recordings, the gap in accuracy is small and the cost difference is enormous, which is why most teams now use AI transcription for first drafts and reserve human review for the audio that genuinely needs it.

Try transcription free

Convert any audio or video to clean, unwatermarked text — speaker labels, timestamps, and AI summaries included. First 10 minutes free, no account.

apitranscription

AWS Transcribe Pricing 2026: $0.024/min Entry, $0.0078 at Scale

AWS Transcribe pricing 2026: Standard starts at $0.024/min and drops to $0.0078/min above 5M minutes/month. Medical is $0.075/min. Free 60 min/month for first 12 months. When AWS beats Deepgram and when it doesn't.

Feb 23, 202611 min

apitranscription

Best Speech-to-Text APIs in 2026 (Real Prices)

A current, honestly-priced comparison of the best speech-to-text APIs in 2026: Deepgram, AssemblyAI, OpenAI, Google, AWS, Rev, Speechmatics, and Cloudflare. Real per-minute and per-hour prices from ea

Feb 23, 202616 min

Summarize this article with:

OpenAI Whisper API Pricing in 2026: The Short Version

What Changed Since the "$0.006 Flat Rate" Days

The Detail Most Articles Miss: Why You Still Need whisper-1

The 25MB and 25-Minute File Limits

How to get around the limits

Self-Hosting Whisper: Free Software, Real Infrastructure Cost

OpenAI vs the Competition: Verified June 2026 Rates

What the table actually tells you

When OpenAI Transcription Is the Right Choice

Estimating Your Monthly OpenAI Transcription Bill

Frequently Asked Questions

How much does the OpenAI Whisper API cost per minute in 2026?

What is the difference between whisper-1 and gpt-4o-transcribe?

Is there a free tier for OpenAI transcription?

Can OpenAI do speaker diarization now?

What is the file size limit for the OpenAI transcription API?

Which is cheaper in 2026, OpenAI or Deepgram?

Can I use the OpenAI Whisper API for real-time transcription?

How does API transcription compare to hiring a human transcriptionist?

Try transcription free

Related Articles

AWS Transcribe Pricing 2026: $0.024/min Entry, $0.0078 at Scale

Best Speech-to-Text APIs in 2026 (Real Prices)