
OpenAI Whisper API Pricing Per Minute in 2026
OpenAI's Whisper API has become one of the most widely discussed transcription services since its launch, largely because of its straightforward pricing and the reputation of the open-source Whisper model behind it. But understanding what you actually pay, what you get, and what you give up compared to other options requires a closer look than most summaries provide.
This guide covers everything you need to know about OpenAI Whisper API pricing per minute in 2026, including the real costs, file size limitations, self-hosting economics, and how Whisper stacks up against paid alternatives from Google, AWS, and Deepgram.
Whisper API Pricing at a Glance
OpenAI keeps Whisper API pricing simple. There is one price, one model, and no complicated tier system to navigate.
| Detail | Value |
|---|---|
| Price per minute | $0.006 |
| Price per hour | $0.36 |
| Billing increment | Per second (rounded to nearest second) |
| Free tier | None |
| Volume discounts | None |
| Minimum charge | Based on actual audio duration |
| Model | Whisper v3 (large-v3) |
That flat rate of $0.006 per minute applies regardless of how much audio you process. Whether you transcribe ten minutes of audio per month or ten thousand hours, the per-minute cost stays the same. OpenAI does not offer volume discounts, committed use pricing, or a free tier for the Whisper API.
For context, one hour of transcription costs $0.36. Ten hours costs $3.60. One hundred hours costs $36. The math is straightforward, which is one of Whisper's biggest selling points for teams that want predictable billing without surprises.
You pay only for the actual duration of audio in your file. If you submit a five-minute recording, you pay for five minutes regardless of how long the API takes to process it. Silence within the file still counts toward the duration since the API processes the entire audio stream.
Visit OpenAI's pricing page for the latest official rates.
What You Get for $0.006 Per Minute
At that price point, the Whisper API delivers a capable transcription service backed by one of the most well-known speech recognition models in the world.
Included Features
Whisper v3 (large-v3) model. The API runs OpenAI's largest and most accurate Whisper model. You do not need to choose between model sizes or manage different accuracy tiers. Every request uses the same model.
57+ language support. Whisper handles transcription across dozens of languages, from English and Spanish to Hindi, Arabic, Japanese, and many others. Language support is one of Whisper's strongest advantages. The full list of supported languages is available in the Whisper GitHub repository.
Automatic language detection. If you do not specify the language, Whisper will detect it automatically from the first 30 seconds of audio. This works well for most recordings, though specifying the language when you know it will improve accuracy.
Word-level timestamps. The API can return timestamps for individual words, which is useful for subtitle generation, audio search, and aligning transcripts with video.
Translation to English. Beyond transcription, Whisper offers a translation mode that transcribes non-English audio directly into English text in a single step, without needing a separate translation service.
What You Do NOT Get
Understanding Whisper's limitations is just as important as knowing its features, especially when comparing it against paid alternatives.
No streaming or real-time transcription. The Whisper API is batch-only. You upload a complete audio file and receive the full transcript once processing finishes. There is no WebSocket endpoint, no partial results, and no live captioning capability. If you need real-time transcription for meetings or live broadcasts, Whisper is not the right tool.
No native speaker diarization. Whisper does not identify or label different speakers in a conversation. The transcript comes as a single stream of text without any indication of who said what. You can layer third-party diarization on top, but that adds complexity and cost.
No custom vocabulary or keyword boosting. If your audio contains specialized terminology, brand names, or uncommon words, you cannot provide a custom dictionary to improve accuracy. Whisper transcribes based on its training data alone.
No sentiment analysis or topic detection. The API returns text and timestamps. It does not analyze the emotional tone, identify discussion topics, or provide any semantic analysis of the content.
No automatic punctuation control. Whisper adds punctuation automatically, but you have limited control over how it punctuates. For some languages, punctuation accuracy can be inconsistent.

The 25MB File Size Limit
One of the most common frustrations with the Whisper API is its 25MB file size limit per request. For short recordings or compressed audio, this is rarely an issue. But for longer files, especially video files or high-bitrate recordings, you will hit this ceiling quickly.
A 25MB limit translates roughly to:
| Format | Approximate Duration at 25MB |
|---|---|
| MP3 at 128kbps | ~26 minutes |
| MP3 at 64kbps | ~52 minutes |
| WAV (16-bit, 16kHz mono) | ~13 minutes |
| M4A at 128kbps | ~26 minutes |
| MP4 video | 3-10 minutes (varies widely) |
If your recordings exceed these durations, you need a workaround strategy.
Strategy 1: Compress Before Uploading
The simplest approach is converting your audio to a compressed format before sending it to the API. Converting a WAV file to MP3 at 64kbps can reduce file size by 90% or more while maintaining transcription accuracy that is virtually identical to the original. Tools like FFmpeg handle this conversion efficiently. For speech-only content, 64kbps MP3 preserves all the frequency information that Whisper needs.
Strategy 2: Split Long Files into Chunks
For recordings that exceed 25MB even after compression, splitting the file into smaller segments is the standard approach. The key consideration here is choosing split points carefully. Splitting mid-sentence will cause the API to return incomplete sentences at chunk boundaries, and you will need to stitch the results together intelligently. Splitting at natural silence points produces cleaner results.
Most audio processing libraries can detect silence gaps in recordings and use those as split points. Aim for chunks between 20 and 24MB to leave a small buffer below the limit.
Strategy 3: Extract Audio from Video
If you are working with video files, extract the audio track first rather than uploading the full video. A one-hour MP4 video might be 500MB, but the audio track alone as an MP3 could be under 30MB. This extraction step is fast and avoids sending video data that the API ignores anyway.
Strategy 4: Use a Service That Handles This for You
Platforms like ConvertAudioToText handle file preprocessing automatically. You upload files of any size, and the platform manages compression, chunking, and reassembly behind the scenes. This eliminates the engineering overhead of building your own preprocessing pipeline.
Self-Hosting Whisper: Free But Not Free
Since Whisper is open source, you can run it on your own infrastructure without paying OpenAI anything. The model weights are freely available, and the codebase is well-documented. This raises an obvious question: why pay $0.006 per minute when you can run it for free?
The answer comes down to GPU costs, engineering time, and scale.
GPU Costs for Self-Hosting
Whisper's large-v3 model requires a capable GPU to run at reasonable speeds. Here is what the most common options cost across major cloud providers:
| GPU | Cloud Cost (per hour) | Whisper Processing Speed | Cost Per Audio Hour |
|---|---|---|---|
| NVIDIA A100 (80GB) | ~$1.50 - $3.00 | ~6-10x real-time | $0.15 - $0.50 |
| NVIDIA A10G | ~$0.75 - $1.20 | ~3-5x real-time | $0.15 - $0.40 |
| NVIDIA T4 | ~$0.35 - $0.50 | ~1-2x real-time | $0.18 - $0.50 |
| NVIDIA L4 | ~$0.50 - $0.80 | ~3-4x real-time | $0.13 - $0.27 |
Processing speed matters because it determines how many audio hours you can transcribe per GPU hour. An A100 can process a one-hour recording in about six to ten minutes, meaning you can transcribe roughly six to ten hours of audio per GPU hour. A T4 processes closer to real-time, so one GPU hour yields roughly one to two hours of transcribed audio.
The Hidden Costs
Raw GPU rental is only part of the equation. Self-hosting Whisper also requires:
Infrastructure management. You need to provision servers, manage scaling, handle failover, and maintain uptime. This is not a set-and-forget deployment.
Queue and job management. Processing audio at scale requires a job queue, worker management, retry logic, and status tracking. You are essentially building the backend infrastructure that managed APIs provide out of the box.
Model updates. When OpenAI releases improved model weights, you need to update your deployment. Testing, validating, and rolling out model updates takes engineering time.
Monitoring and debugging. GPU workloads fail in ways that are different from typical web services. Out-of-memory errors, driver issues, and CUDA version conflicts are common headaches.
Storage and networking. Moving audio files to your GPU servers and storing results adds bandwidth and storage costs that are easy to overlook.
When Self-Hosting Saves Money
Self-hosting starts making financial sense when you process large, consistent volumes. The break-even point depends on your infrastructure choices, but as a rough guide:
| Monthly Volume | API Cost (at $0.006/min) | Self-Hosting Estimate | Better Option |
|---|---|---|---|
| Under 100 hours | $36 | $150 - $400+ | API |
| 100 - 500 hours | $36 - $180 | $200 - $600 | API (usually) |
| 500 - 2,000 hours | $180 - $720 | $300 - $800 | Self-hosting (depends) |
| Over 2,000 hours | $720+ | $400 - $1,200 | Self-hosting |
These estimates assume you have engineering resources available to build and maintain the infrastructure. If you need to hire or divert engineers to manage a self-hosted Whisper deployment, the break-even point shifts significantly higher.
Self-Hosting Pros and Cons
| Pros | Cons |
|---|---|
| No per-minute charges | Significant upfront engineering effort |
| Full control over the model and pipeline | GPU costs add up quickly at low volumes |
| Data stays on your infrastructure | You handle scaling, monitoring, and updates |
| Can customize preprocessing and postprocessing | No SLA or support from OpenAI |
| No file size limits | Requires GPU expertise on your team |
Whisper vs Paid Alternatives
Price per minute is the most visible metric, but it tells an incomplete story. What matters is the total cost of achieving the transcription quality and features your workflow requires.
Pricing Comparison Table
| Service | Price Per Minute | Speaker Diarization | Streaming | Languages | Custom Vocabulary |
|---|---|---|---|---|---|
| OpenAI Whisper API | $0.006 | No | No | 57+ | No |
| Deepgram (Nova-2) | $0.0043 | Yes | Yes | 36+ | Yes |
| Google Cloud STT | $0.016 | Yes | Yes | 125+ | Yes (phrases) |
| AWS Transcribe | $0.024 | Yes | Yes | 37+ | Yes |
| Azure Speech | $0.016 | Yes | Yes | 100+ | Yes |
| AssemblyAI | $0.0065 | Yes | Yes | 20+ | No |
What the Table Reveals
Whisper is cheap but not the cheapest. Deepgram's Nova-2 model undercuts Whisper at $0.0043 per minute while also including speaker diarization, streaming, custom vocabulary, and features like sentiment analysis and topic detection. For applications that need more than raw transcription, Deepgram often provides better value despite being a paid service. You can read more about how these services compare in our speech-to-text API pricing comparison.
Enterprise services charge more but deliver more. Google Cloud Speech-to-Text and AWS Transcribe cost two to four times more than Whisper, but they include streaming support, speaker diarization, custom vocabulary, and enterprise-grade SLAs. For production applications where downtime costs real money, those extras matter.
Missing features have hidden costs. If you need speaker diarization with Whisper, you will need to add a third-party diarization service or library, which adds complexity, latency, and potentially additional cost. The "savings" from Whisper's lower per-minute rate can evaporate when you factor in the integrations needed to match what other services include natively.
For a detailed comparison between Whisper and Google's offering, see our Whisper vs Google Cloud Speech analysis.

When Whisper Is the Right Choice
Despite its limitations, there are scenarios where the Whisper API is genuinely the best option.
Multilingual Batch Transcription
If you regularly transcribe audio in many different languages and do not need real-time results, Whisper's combination of broad language support and low cost is hard to beat. A media company transcribing interviews in French, Japanese, and Portuguese can use a single API with consistent pricing across all languages. Most competitors either charge more for non-English languages or support fewer of them.
Budget-Friendly Small-Scale Projects
For startups, researchers, and individual developers processing modest volumes of audio, Whisper's $0.006 per minute rate keeps costs minimal. Transcribing ten hours of audio costs $3.60 with zero infrastructure to manage. There is no minimum commitment, no subscription, and no upfront cost beyond your OpenAI API credits.
Open-Source Preference and Data Control
Organizations that prefer open-source technology can use the Whisper API as a convenient entry point and migrate to self-hosted Whisper later if volumes grow. Since the model is identical, transcription results will be consistent across both deployment models. This is a flexibility that proprietary services cannot match. For developers exploring free transcription options, Whisper's open-source availability is a significant advantage.
Simple Transcription Without Extras
If your use case is straightforward, upload a file and get a transcript, Whisper delivers exactly that without paying for features you do not need. You are not subsidizing a diarization engine, a streaming infrastructure, or a custom vocabulary system that you will never use.
Prototyping and Development
Whisper is an excellent choice for building and testing transcription features before committing to a more feature-rich (and expensive) service. The simple API, predictable pricing, and well-documented integration guide make it easy to get a working prototype running quickly.
Estimating Your Monthly Costs
To help you budget, here are some common use cases with estimated monthly costs using the Whisper API.
| Use Case | Monthly Volume | Monthly Cost |
|---|---|---|
| Freelance journalist transcribing interviews | 10 hours | $3.60 |
| Small podcast team transcribing episodes | 20 hours | $7.20 |
| Research team processing focus groups | 50 hours | $18.00 |
| Content agency transcribing client media | 200 hours | $72.00 |
| Enterprise processing call recordings | 1,000 hours | $360.00 |
| Large-scale media archive digitization | 5,000 hours | $1,800.00 |
These costs cover only the Whisper API usage. If you need additional processing such as diarization, formatting, or subtitle generation, factor in the cost of those additional services or consider a platform like ConvertAudioToText that bundles these features together.
Frequently Asked Questions
Is there a free tier for the OpenAI Whisper API?
No. OpenAI does not offer a free tier for the Whisper API. Every minute of audio processed is billed at $0.006. New OpenAI accounts sometimes receive a small amount of free API credits that can be used toward Whisper, but this is a one-time introductory credit, not an ongoing free tier. If you need free transcription, the open-source Whisper model can be self-hosted at no software cost, though you will need to provide your own GPU infrastructure.
How does Whisper API pricing compare to hiring human transcriptionists?
Human transcription services typically charge between $1.00 and $3.00 per audio minute for standard turnaround, with rush rates going higher. At $0.006 per minute, the Whisper API is roughly 150 to 500 times cheaper than human transcription. However, human transcriptionists still deliver higher accuracy on difficult audio, including recordings with heavy accents, overlapping speakers, or significant background noise. For clean, single-speaker audio, Whisper's accuracy is competitive with human transcription at a fraction of the cost.
Can I use Whisper for real-time transcription of live meetings?
No. The Whisper API only supports batch processing. You must upload a complete audio file and wait for the full transcript to be returned. There is no streaming endpoint, no WebSocket support, and no way to receive partial results as audio is being recorded. For real-time meeting transcription, you will need a service that supports streaming, such as Deepgram, Google Cloud Speech-to-Text, or AssemblyAI.
Does OpenAI offer volume discounts for high-usage Whisper API customers?
As of early 2026, OpenAI does not offer publicly listed volume discounts for the Whisper API. The $0.006 per minute rate is the same whether you process one hour or ten thousand hours per month. Enterprise customers with very high volumes may be able to negotiate custom pricing by contacting OpenAI's sales team directly, but there is no self-serve discount structure.
What happens if my audio file exceeds the 25MB size limit?
The API will reject the request with an error. You need to reduce the file size before uploading. The most effective approaches are compressing the audio to a lower-bitrate MP3 format, extracting audio from video files to eliminate the video data, or splitting long recordings into smaller chunks. Most production implementations automate this preprocessing step so that end users never encounter the limit directly.
Try transcription free
Convert any audio or video to accurate text in seconds. Speaker labels, timestamps, and AI summaries included. No account required.
Related Articles

AWS Transcribe Pricing Per Minute in 2026 (Standard, Medical, Call Analytics)
Complete guide to AWS Transcribe pricing in 2026. Covers Standard, Medical, and Call Analytics tiers, streaming costs, free tier limits, and when AWS Transcribe makes financial sense.

What Are the Cost Advantages of an All-in-One API Like Deepgram?
Discover why all-in-one speech APIs like Deepgram save money compared to multi-vendor approaches. Covers hidden costs, TCO analysis, bundled features, and when single-vendor wins.