Speech-to-Text API Pricing in 2026: Google Cloud vs AWS Transcribe vs OpenAI Whisper vs Deepgram
apitranscriptionpricingdevelopers

Speech-to-Text API Pricing in 2026: Google Cloud vs AWS Transcribe vs OpenAI Whisper vs Deepgram

ConvertAudioToText TeamFebruary 23, 202616 min read

The Speech-to-Text API Pricing Landscape in 2026

Building a product that turns audio into text means choosing a speech-to-text API. And while accuracy and latency get all the attention during proof-of-concept, pricing is what determines viability at scale. A difference of $0.002 per minute sounds trivial until you are processing 10,000 hours per month and that difference is costing you $1,200 extra.

This guide breaks down speech-to-text API pricing across the four major providers in 2026: Google Cloud Speech-to-Text, AWS Transcribe, OpenAI Whisper API, and Deepgram. We tested each provider, reviewed their pricing documentation, and calculated real-world costs at multiple volume tiers so you can make an informed decision before writing a single line of integration code.

If you are evaluating these providers individually, we have dedicated deep dives on Google Cloud Speech-to-Text pricing, AWS Transcribe pricing, and OpenAI Whisper API pricing.

Complete Pricing Comparison Table

Here is the full breakdown of speech-to-text API pricing across all four providers as of February 2026.

FeatureGoogle Cloud STTAWS TranscribeOpenAI Whisper APIDeepgram
Price/Min (Standard)$0.024$0.024$0.006$0.0043
Price/Min (Enhanced)$0.036$0.024 (medical: $0.075)N/A (single model)$0.0059 (Nova-3)
Free Tier60 min/month60 min/month (12 months)None$200 credit
Streaming SupportYesYesNo (batch only)Yes
Languages125+100+5736
Speaker DetectionYesYesNo (via post-processing)Yes
Custom VocabularyYesYesVia promptingYes
Best ForEnterprise, multilingualAWS-native apps, medicalCost-sensitive batch jobsHigh-volume, real-time

A few things stand out immediately. OpenAI Whisper API is the cheapest per minute at $0.006, but it lacks streaming and native speaker diarization. Deepgram offers the best balance between low pricing and full-featured real-time capabilities. Google Cloud and AWS Transcribe are priced nearly identically for standard transcription but diverge significantly in their enhanced and specialized tiers.

Pricing comparison across speech-to-text API providers

Understanding Each Provider's Pricing Model

Google Cloud Speech-to-Text Pricing

Google Cloud Speech-to-Text uses a tiered pricing model based on the recognition model and features you enable.

Base rates are calculated in 15-second increments, rounded up. If your audio clip is 16 seconds long, you pay for 30 seconds. This rounding affects cost calculations meaningfully when processing many short audio clips.

  • Standard recognition: $0.024 per minute
  • Enhanced recognition (Chirp 2): $0.036 per minute
  • Medical transcription: $0.078 per minute
  • Data logging opt-in discount: 33% reduction on standard models

The free tier gives you 60 minutes of standard recognition per month, indefinitely. This is genuinely useful for development and testing but disappears quickly in production.

Feature surcharges add up. Enabling speaker diarization, automatic punctuation, or word-level timestamps does not change the base price, but switching to the enhanced Chirp 2 model for better accuracy increases cost by 50%.

AWS Transcribe Pricing

AWS Transcribe keeps pricing straightforward with a single rate for standard transcription and volume-based discounts at higher tiers.

  • Standard transcription: $0.024 per minute
  • After 250K minutes/month: $0.015 per minute
  • After 1M minutes/month: $0.0102 per minute
  • Medical transcription: $0.075 per minute
  • Call analytics: $0.024 per minute (streaming), $0.012 per minute (post-call)

AWS rounds up to the nearest second rather than 15-second increments, making it slightly more cost-efficient than Google Cloud for short clips. The free tier provides 60 minutes per month for the first 12 months only. After that, every minute costs money.

The volume discounts are where AWS becomes competitive. If you are processing more than 250,000 minutes per month (roughly 4,167 hours), the price drops to $0.015 per minute, a 37.5% discount. At over a million minutes per month, it drops further to $0.0102, undercutting most competitors at that scale.

OpenAI Whisper API Pricing

OpenAI's API pricing for Whisper is the simplest of all four providers.

  • Whisper large-v3: $0.006 per minute
  • No free tier
  • No volume discounts
  • No streaming

That is it. There are no tiers, no feature surcharges, and no model variants to choose between. You pay $0.006 per minute of audio processed, billed per second of input audio.

The catch is what you do not get. Whisper API is batch-only with no real-time streaming support. There is no native speaker diarization, no custom vocabulary, and no word-level confidence scores. You send audio, you get text back. For many applications, that is perfectly sufficient. For others, the missing features mean you need to build or buy additional processing layers.

Deepgram Pricing

Deepgram's pricing varies based on the model and whether you need pre-recorded or streaming transcription.

  • Nova-2 (pre-recorded): $0.0043 per minute
  • Nova-2 (streaming): $0.0059 per minute
  • Nova-3 (pre-recorded): $0.0059 per minute
  • Nova-3 (streaming): $0.0065 per minute
  • Whisper Cloud (pre-recorded): $0.0048 per minute
  • Pay-as-you-go credit: $200 free to start

Deepgram's pricing is consistently the lowest among full-featured providers. Even the premium Nova-3 model for streaming comes in at $0.0065 per minute, which is still cheaper than Google Cloud's standard model.

The $200 starting credit is generous enough to process approximately 46,500 minutes (775 hours) of audio on the Nova-2 pre-recorded model, making it one of the best free tiers for serious evaluation.

Cost at Scale: Real-World Calculations

Pricing per minute is useful for comparison, but what matters is the actual invoice at the end of the month. Here are the real costs at three volume tiers that represent common production workloads.

100 Hours Per Month (Startup/SMB)

This volume is typical for a transcription SaaS serving a few hundred active users, a podcast network processing weekly episodes, or a small call center.

ProviderModelMonthly Cost
Google Cloud STTStandard$144.00
Google Cloud STTEnhanced (Chirp 2)$216.00
AWS TranscribeStandard$144.00
OpenAI Whisperlarge-v3$36.00
DeepgramNova-2 (pre-recorded)$25.80
DeepgramNova-3 (pre-recorded)$35.40

At 100 hours per month, Deepgram Nova-2 costs 82% less than Google Cloud or AWS standard. OpenAI Whisper is the second cheapest option at $36.00 but lacks streaming and diarization. The gap between Deepgram and the hyperscaler pricing is already meaningful at this volume.

1,000 Hours Per Month (Growth Stage)

A growing SaaS product, a mid-size call center, or a media company processing content at scale.

ProviderModelMonthly Cost
Google Cloud STTStandard$1,440.00
Google Cloud STTEnhanced (Chirp 2)$2,160.00
AWS TranscribeStandard$1,440.00
OpenAI Whisperlarge-v3$360.00
DeepgramNova-2 (pre-recorded)$258.00
DeepgramNova-3 (pre-recorded)$354.00

At 1,000 hours the pattern holds. Google Cloud and AWS are running at $1,440 per month while Deepgram processes the same volume for $258. That is $14,184 saved per year by choosing Deepgram Nova-2 over Google Cloud standard. For a startup watching its burn rate, that is a meaningful difference.

10,000 Hours Per Month (Enterprise)

Enterprise-scale processing. Think global call centers, large media archives, compliance recording systems, or a transcription API serving thousands of developers.

ProviderModelMonthly Cost
Google Cloud STTStandard$14,400.00
Google Cloud STTEnhanced (Chirp 2)$21,600.00
AWS TranscribeStandard (tiered)$10,620.00
OpenAI Whisperlarge-v3$3,600.00
DeepgramNova-2 (pre-recorded)$2,580.00
DeepgramNova-3 (pre-recorded)$3,540.00

At enterprise scale, AWS volume discounts finally kick in and bring the price down to $10,620 (from $14,400 without discounts). But Deepgram Nova-2 still wins at $2,580, saving over $8,000 per month compared to discounted AWS pricing and over $11,800 per month compared to Google Cloud standard.

OpenAI Whisper at $3,600 is competitive on price but remember: no streaming, no diarization, no custom vocabulary. At 10,000 hours per month, you almost certainly need those features.

Hidden Costs You Should Know

The per-minute rate on each provider's pricing page tells only part of the story. Several additional costs can significantly affect your total spend.

Data Transfer and Egress Fees

Google Cloud and AWS charge for data egress when you move transcription results out of their cloud. If your application runs outside their ecosystem, expect to pay:

  • Google Cloud: $0.12 per GB for the first 1 TB of egress per month
  • AWS: $0.09 per GB for the first 10 TB of egress per month
  • OpenAI: No egress fees (results delivered via API response)
  • Deepgram: No egress fees

Transcript data is small (a few KB per hour of audio), so egress on results is negligible. But if you are uploading audio files to cloud storage first, the ingress of audio files and any intermediate processing adds up. A one-hour WAV file at CD quality is approximately 635 MB. Processing 10,000 hours means ingesting roughly 635 TB of audio per month.

Audio Storage Costs

If you need to retain source audio for compliance, reprocessing, or quality assurance:

  • Google Cloud Storage: $0.020 per GB/month (Standard)
  • AWS S3: $0.023 per GB/month (Standard)
  • Cloudflare R2: $0.015 per GB/month (no egress fees)

For 10,000 hours of compressed audio (approximately 60 TB at 128kbps MP3), monthly storage costs range from $900 to $1,380 depending on the provider. Using a storage provider like Cloudflare R2 that does not charge egress fees can save substantially over time.

Audio Preprocessing

Not all providers accept all audio formats natively. If your source audio is in a format that requires conversion (video files, uncommon codecs, multi-channel layouts), you need to factor in:

  • FFmpeg processing time on your own infrastructure
  • Compute costs for audio extraction and conversion
  • Temporary storage for intermediate files

Deepgram and Google Cloud accept the widest range of input formats. AWS Transcribe requires audio to be in S3 or streamed in specific formats. OpenAI Whisper accepts most common formats but has a 25 MB file size limit per request, requiring you to chunk longer files.

SDK and Integration Overhead

Google Cloud and AWS both require their respective SDKs, authentication setup, and IAM configuration. This is not a direct dollar cost, but the engineering time to integrate, maintain, and debug these SDKs is real.

OpenAI and Deepgram offer simpler REST APIs that can be called with a single HTTP request and an API key. For small teams, this difference in integration complexity translates to days of saved engineering time.

Developer workspace for API integration

Which API Wins for Your Use Case?

There is no single best speech-to-text API. The right choice depends on your volume, feature requirements, existing infrastructure, and budget constraints.

Best for Startups and Small Teams

Winner: Deepgram

Startups need low costs, simple integration, and room to scale without renegotiating contracts. Deepgram checks all three boxes. The $200 free credit gives you enough runway to build and test your integration thoroughly. Nova-2 at $0.0043 per minute keeps costs manageable as you grow. The REST API is straightforward, and you get streaming, diarization, and topic detection without paying extra.

Runner-up: OpenAI Whisper API if you only need batch transcription and can live without streaming. At $0.006 per minute with zero feature overhead, it is hard to beat for simplicity.

Best for Enterprise and High Volume

Winner: AWS Transcribe (with volume discounts) or Deepgram (with enterprise agreement)

At enterprise scale, the conversation shifts from list prices to negotiated rates. AWS Transcribe's automatic volume discounts at 250K and 1M minutes per month make it increasingly competitive without requiring a sales call. If you are already running on AWS infrastructure, the integration is seamless and you avoid cross-cloud data transfer.

Deepgram offers enterprise agreements with custom pricing that, according to published case studies, can bring per-minute costs below $0.003 for very high volumes. If you are not locked into a specific cloud provider, Deepgram's enterprise tier is worth exploring.

Google Cloud becomes a strong option if you are already heavily invested in GCP and value the breadth of their language support (125+ languages versus Deepgram's 36).

Best for Multilingual Applications

Winner: Google Cloud Speech-to-Text

If your product serves a global user base and needs to transcribe audio in dozens of languages with high accuracy, Google Cloud's 125+ language support is unmatched. The Chirp 2 model brings significant accuracy improvements for non-English languages compared to earlier models.

Runner-up: OpenAI Whisper supports 57 languages and handles code-switching (multiple languages in one recording) better than most competitors. At $0.006 per minute, it is a cost-effective multilingual option.

Deepgram supports 36 languages, which covers most major global languages but may fall short if you need less common languages or regional dialects.

Best for Medical and Healthcare

Winner: AWS Transcribe Medical

AWS Transcribe Medical is purpose-built for healthcare transcription with HIPAA compliance, medical vocabulary, and specialty-specific models for cardiology, neurology, oncology, radiology, and urology. At $0.075 per minute it is expensive, but healthcare organizations typically prioritize compliance and accuracy over cost.

Runner-up: Google Cloud offers medical transcription at $0.078 per minute with Healthcare API integration and HIPAA-eligible infrastructure.

Neither OpenAI Whisper nor Deepgram offers dedicated medical models, though Deepgram's custom vocabulary feature can be trained to recognize medical terminology.

Best for Real-Time and Streaming

Winner: Deepgram

Deepgram's streaming transcription at $0.0059 per minute (Nova-2) delivers sub-300ms latency with speaker diarization, interim results, and endpointing. For applications like live captioning, voice assistants, or real-time meeting transcription, this combination of low latency, full features, and low cost is hard to beat.

Runner-up: Google Cloud offers reliable streaming with broad language support but at 4x the cost of Deepgram's streaming rate.

Not an option: OpenAI Whisper API does not support streaming at all as of February 2026.

How ConvertAudioToText Keeps Costs Low

At ConvertAudioToText, we built our transcription pipeline on Deepgram's infrastructure specifically because of the pricing and feature balance outlined above. By using Deepgram's Nova models, we pass those cost savings to our users while still delivering speaker diarization, sentiment analysis, topic detection, and multiple export formats.

Our architecture processes audio asynchronously through a queue-based system, which means we can batch requests efficiently and avoid the overhead of maintaining persistent streaming connections when they are not needed. The result is a service that gives you the accuracy of Deepgram's best models at a fraction of what you would pay integrating the API directly, once you factor in infrastructure, preprocessing, and maintenance.

For developers who want the raw API, check out our best speech-to-text APIs guide for integration walkthroughs and code samples.

Pricing Trends: Where Speech-to-Text API Costs Are Heading

Over the past three years, per-minute pricing across all major providers has dropped by 30 to 50 percent. This trend shows no sign of slowing. Several forces are driving costs down:

Model efficiency improvements. Newer architectures like Deepgram's Nova-3 and Google's Chirp 2 achieve higher accuracy with fewer compute resources per minute of audio processed. As these models mature, providers pass some of those savings on to customers.

Competition from open-source models. OpenAI's open-source release of Whisper forced commercial providers to justify their pricing premium with features (streaming, diarization, custom models) that Whisper does not offer. This healthy competitive pressure keeps prices moving downward.

Hardware improvements. The shift to more efficient inference hardware (custom ASICs, optimized GPU clusters) reduces the raw compute cost of running speech recognition at scale.

Volume growth. As more applications integrate speech-to-text (accessibility features, content indexing, meeting tools, voice interfaces), total market volume increases, allowing providers to amortize infrastructure costs across more minutes.

For anyone planning a multi-year integration, this means the API you choose today will likely get cheaper over time. Lock in favorable pricing terms now, and budget for 10 to 15 percent annual cost reductions.

Frequently Asked Questions

What is the cheapest speech-to-text API in 2026?

Deepgram Nova-2 at $0.0043 per minute is the cheapest full-featured speech-to-text API. If you only need batch transcription without streaming or diarization, OpenAI Whisper API at $0.006 per minute is the cheapest option with no volume commitment. For very high volumes (over 1 million minutes per month), AWS Transcribe's tiered pricing drops to $0.0102 per minute, which is competitive with Deepgram's list price.

Does Google Cloud Speech-to-Text have a free tier?

Yes. Google Cloud offers 60 minutes of free standard recognition per month with no time limit. This free tier persists indefinitely, unlike AWS Transcribe's free tier which expires after 12 months. The 60 minutes is enough for development and testing but not for any meaningful production workload.

Is OpenAI Whisper API good enough for production use?

For batch transcription workloads where you do not need real-time streaming or speaker diarization, Whisper API is a solid production choice. The accuracy is comparable to more expensive alternatives, especially for English. The main limitations are the lack of streaming support, the 25 MB file size limit per request, and no native speaker identification. If your use case requires any of these features, you will need to either build additional processing layers or choose a different provider.

How do I estimate my monthly speech-to-text API cost?

Calculate your monthly audio volume in minutes, then multiply by the per-minute rate of your chosen provider and model. For example, if you process 500 hours of audio per month using Deepgram Nova-2: 500 hours x 60 minutes x $0.0043 = $129 per month. Remember to factor in hidden costs like audio storage, data egress (for Google Cloud and AWS), and any preprocessing compute you need for audio format conversion.

Can I switch speech-to-text APIs without rebuilding my application?

Switching providers requires changes to your API integration code, authentication, and potentially your audio preprocessing pipeline, but the core workflow (send audio, receive text) is similar across all providers. To minimize switching costs, abstract your transcription logic behind an internal interface so you can swap providers without touching the rest of your application. Services like ConvertAudioToText handle this abstraction for you, so you get the best available transcription without managing API integrations directly.

Try transcription free

Convert any audio or video to accurate text in seconds. Speaker labels, timestamps, and AI summaries included. No account required.

Related Articles