transcriptionaudio formatstechnical

Supported Audio Formats: The Compatibility Table

BMMamane B. MoussaApril 14, 2026Updated July 2, 202612 min read

Summarize this article with:

The Compatibility Table

Every major transcription tool accepts MP3, WAV, M4A, and MP4. The differences show up in the less common formats, file size caps, and codec restrictions. Use the table below to check your format before uploading, then read the sections below for context on the edge cases that cause real failures.

Format	Type	CATT	Otter.ai	TurboScribe	Descript	Rev.com	Trint	Fireflies	Happy Scribe	AWS Transcribe	OpenAI Whisper API
MP3	Audio	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
WAV	Audio	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
M4A	Audio	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	No*	Yes
FLAC	Audio	Yes	No	Yes	Yes	Yes	No	No	Yes	Yes	No
AAC	Audio	Yes	No	Yes	Yes	Yes	Yes	No	Yes	No	No
OGG	Audio	Yes	Yes	Yes	No	No	No	No	Yes	Yes	No
Opus	Audio	Yes	No	Yes	No	No	No	No	Yes	Yes (via OGG/WebM)	No
WMA	Audio	Yes	Yes	Yes	No	No	Yes	No	Yes	No	No
AIFF	Audio	No	No	Yes	Yes	No	No	No	Yes	No	No
MP4	Video	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes	Yes
MOV	Video	Yes	Yes	Yes	Yes	Yes	Yes	No	Yes	No	No
AVI	Video	Yes	Yes	Yes	No	No	Yes	No	Yes	No	No
MKV	Video	Yes	Yes	Yes	No	No	No	No	Yes	No	No
WebM	Video	Yes	No	Yes	No	No	No	Yes	Yes	Yes	Yes

*AWS Transcribe batch supports MP4 container (which can carry AAC audio), but not standalone M4A files by name per their documentation. Check your specific use case.

Checked: July 2026. Sources listed at the bottom.

File Size and Duration Caps

This is where people get surprised. Format support matters less than whether your file physically fits inside the service's upload gate.

Service	File Size Limit	Duration Limit	Notes
ConvertAudioToText	100 MB	No hard cap	Minutes-based quota applies per plan
Otter.ai	5 GB	No stated cap	3 imports/lifetime free; 10/mo Pro
TurboScribe	5 GB	10 hours	Up to 50 files at once on Unlimited
Descript	50 GB per project	15 hours for auto-transcription	3+ audio channels downmixed to stereo
Rev.com	2 GB (API upload), 5 TB (URL)	17 hours	Legal transcription: 20 GB total
Trint	3 GB (recommended)	3 hours (recommended)	Larger files accepted but error-prone
Fireflies	200 MB	150 minutes	Free tier capped at 100 MB
Happy Scribe	No stated limit	No stated limit	31 audio + 16 video formats
AWS Transcribe	2 GB	4 hours	Files must be in S3 for batch jobs
OpenAI Whisper API	25 MB	No stated cap	Affects WAV most: a 25 MB WAV is roughly 25 minutes of audio
Deepgram (API)	2 GB	No stated cap	100+ formats; audio extraction from video
AssemblyAI (API)	2.2 GB (upload), 5 GB (URL)	10 hours	Recommends submitting in native format

The OpenAI Whisper API 25 MB cap is the most common source of confusion. A 2-hour interview as a WAV file is typically 1+ GB. The cap is on file size, not duration, so a 2-hour MP3 at 96 kbps (roughly 86 MB) will fail too. The workaround is to compress to a lower bitrate or use a service built around your file size.

The ConvertAudioToText uploader accepts MP3, WAV, M4A, FLAC, AAC, OGG, Opus, WMA, and all major video containers. The 100 MB per-file cap applies to uploads; URL-sourced jobs (YouTube, Vimeo, direct links) are not subject to this limit.

Audio Formats: What Each One Actually Means

MP3

MP3 is the safe default. It is universally accepted, small, and at 128 kbps or higher there is no measurable accuracy penalty compared to WAV. Below 64 kbps, compression artifacts begin to degrade speech clarity. Old voicemail recordings at 32 kbps are the most common offender.

Practical file sizes: roughly 1 MB per minute at 128 kbps. A 2-hour recording is about 115 MB, which exceeds the Whisper API cap but fits everywhere else.

WAV

WAV is lossless, which matters for archiving, not for transcription accuracy. A WAV and a 128 kbps MP3 of the same speech recording produce virtually identical transcription output from any modern AI model. The meaningful difference is file size: WAV runs roughly 10 MB per minute at CD quality (44.1 kHz, 16-bit stereo).

If you are hitting file size limits, convert to MP3 before uploading. You will not lose any accuracy.

M4A

iPhone Voice Memos exports as M4A. It uses AAC compression inside an MP4 container, which is more efficient than MP3. Files are slightly smaller than equivalent-quality MP3s, and all major consumer transcription tools support it.

AWS Transcribe does not list M4A as a named supported format in its batch documentation; if you are using the AWS API directly, convert to MP3 or WAV first. Consumer tools (CATT, TurboScribe, Otter, Descript, Rev) handle it natively.

FLAC

FLAC is lossless compression. File sizes are 50 to 70 percent smaller than WAV while preserving every bit of the original audio. For transcription purposes it offers no accuracy advantage over a well-encoded MP3, but it is useful when you need to archive the original and upload a single file.

Not all services support it: Otter.ai, Fireflies, and Trint do not accept FLAC. Check before uploading.

OGG / Opus

OGG Vorbis and Opus are open-source formats common in Android recordings and web apps. Opus in particular is used in WebRTC (browser recordings, voice calls, Discord exports). Support is patchy at the consumer level: Descript, Rev, and Fireflies do not support OGG directly. AWS Transcribe and Deepgram handle both via their respective containers.

If your source is an Android recording in OGG, convert to MP3 first for maximum compatibility.

WMA

Windows Media Audio is fading from production use but still appears in older Windows Voice Recorder exports and legacy call-center recordings. TurboScribe, Otter.ai, and Trint accept it; Descript, Fireflies, and Rev do not.

AIFF

AIFF is Apple's uncompressed format, commonly produced by GarageBand and Pro Tools. File sizes are similar to WAV. Only TurboScribe and Descript accept AIFF among the services in this table; if your DAW exports AIFF, check your target service before uploading.

Video Formats and Audio Extraction

All the services in this table extract the audio track from video automatically. You do not need to pre-process a video file before uploading.

MP4 works everywhere. MOV works at most consumer services but not at AWS Transcribe or Fireflies. MKV has limited support (CATT, Otter, TurboScribe, Happy Scribe). If your source is MKV or AVI and you are using a service that does not support it, extract the audio to MP3 with FFmpeg: ffmpeg -i input.mkv -vn -ab 192k output.mp3.

For more detail on subtitle output from video files, see SRT vs VTT vs TTML: which subtitle format to use.

When Format Affects Accuracy (and When It Does Not)

In almost every real-world case, format does not affect accuracy. Modern AI transcription models process audio waveforms after decoding, so the container format and compression codec make no meaningful difference at normal quality settings.

The three cases where format genuinely matters:

Very low bitrate MP3 (under 64 kbps). Compression artifacts in sub-64 kbps audio can degrade speech intelligibility for both human listeners and AI models. This shows up in very old podcast archives, compressed voicemail exports, and aggressively small streaming audio.

Multi-generation re-encoding. Converting WAV to MP3 to OGG back to MP3 stacks lossy compression and loses quality at each step. Work from the original source file whenever possible.

Corrupted or truncated files. A partially downloaded file or an interrupted recording may pass format validation but fail mid-transcription. Re-download or re-export the source.

For a deeper look at what sample rates and channel counts affect accuracy at the API level, see the transcription accuracy explained post.

Codec Quirks Worth Knowing

A few specifics that cause silent failures:

AWS Transcribe OGG requires Opus codec. The service accepts OGG and WebM containers, but only with Opus audio inside. An OGG Vorbis file will fail. This is documented in their input requirements.

Google Cloud STT requires encoding declaration. Unlike Deepgram, Google's API requires you to specify the audio encoding in your request. Sending an MP3 without declaring MP3 in the encoding field will fail. Their V2 API can auto-detect some formats, but V1 does not.

Multi-channel audio (more than 2 channels). AWS Transcribe does not support audio with more than two channels. Descript automatically downmixes files with 3 or more channels to stereo on import. If you are recording in multi-channel (4-channel conference recorder, ambisonic microphone), downmix to stereo before uploading to any API-based service.

PCM encoding inside WAV. Most WAV files use PCM encoding and work everywhere. WAV files with unusual encodings (GSM, ADPCM) may fail at some services even though the container is recognized. FFmpeg's ffprobe can tell you which encoding is inside your WAV file.

For a broader look at format choice across the pipeline from recording to archive, see transcription file formats explained and WAV vs MP3 for transcription.

Converting to a Supported Format

If you have a file in an unsupported format or need to reduce file size before uploading, the audio to text tool accepts the file directly and handles conversion internally. For programmatic workflows, the one-liner below covers most cases:

ffmpeg -i input.anyformat -vn -ar 16000 -ac 1 -ab 64k output.mp3

This extracts audio, sets 16 kHz sample rate (sufficient for speech), converts to mono, and encodes at 64 kbps, which reduces a 1-hour WAV from roughly 600 MB to about 30 MB with no transcription accuracy loss.

If you want to keep the original and just need a smaller copy for upload, the same command works: -ab 128k gives more headroom and is still well under any service cap.

If you just need a clean transcript without a meeting bot or video editor, ConvertAudioToText accepts all the formats in this table (except AIFF) and processes most files in under a minute.

Common Questions

What is the best audio format for transcription accuracy?

Any lossless format (WAV, FLAC) or high-quality lossy format (MP3 at 128 kbps or higher, M4A at standard quality) will produce the same transcription accuracy from any modern AI model. The recording environment and microphone quality matter far more than the audio format. Save WAV or FLAC for archiving; upload MP3 to keep file sizes manageable.

Why does the OpenAI Whisper API reject my file when other services accept it?

The Whisper API enforces a 25 MB file size cap, which is much lower than other services. A 1-hour MP3 at 128 kbps is about 57 MB and will fail. The solution is to compress to a lower bitrate (64 kbps is sufficient for speech), split the file into chunks, or use a service without a 25 MB cap.

Can I upload a video file directly, or do I need to extract the audio first?

You can upload video directly to all the services in this table. They extract the audio track automatically. MP4 works everywhere; MOV and MKV have patchy support at API-level services like AWS Transcribe. If you are using an API directly and your video format is not in the supported list, extract the audio with FFmpeg first.

Do multi-channel (surround sound) recordings work?

Most transcription tools downmix to stereo automatically or explicitly do not support more than two channels. AWS Transcribe documents a hard limit of two channels. Descript downmixes on import. For reliable results, convert any multi-channel recording to mono or stereo before uploading.

Sources

Deepgram supported audio formats: developers.deepgram.com/docs/supported-audio-formats (checked July 2026)
Amazon Transcribe input requirements: docs.aws.amazon.com/transcribe/latest/dg/how-input.html (checked July 2026)
OpenAI Whisper API documentation: help.openai.com/en/articles/7031512-audio-api-faq (checked July 2026)
Google Cloud Speech-to-Text encodings: docs.cloud.google.com/speech-to-text/docs/encoding (checked July 2026)
Otter.ai import help: help.otter.ai (checked July 2026)
TurboScribe support: turboscribe.ai/support (checked July 2026)
Descript supported file types: help.descript.com (checked July 2026)
Rev.com accepted file formats: support.rev.com (checked July 2026)
Trint supported file specifications: info.trint.com/knowledge/supported-file-specifications (checked July 2026)
Fireflies upload guide: guide.fireflies.ai (checked July 2026)
Happy Scribe formats: happyscribe.com/formats (checked July 2026)
AssemblyAI file limits: assemblyai.com/docs/faq (checked July 2026)

Try transcription free

Convert any audio or video to clean, unwatermarked text — speaker labels, timestamps, and AI summaries included. First 10 minutes free, no account.

jargontechnical

Fix Jargon Errors in Transcription: The Glossary Pass

Your transcript turned "Kubernetes" into "cuban itties." Here is the systematic fix for technical jargon in AI transcripts, from quick find-replace to custom vocabulary APIs.

May 26, 202611 min

transcriptionaudio quality

Improve Audio Quality Before Transcription: What Helps

The pre-upload processing chain that helps (mono, normalize, denoise) and the steps that waste your time, honestly.

Apr 14, 202612 min