Transcription File Formats Explained: MP3, WAV, M4A, and the Rest
transcriptionfile formatsaudio

Transcription File Formats Explained: MP3, WAV, M4A, and the Rest

ConvertAudioToText TeamMay 26, 20269 min read

Modern transcription tools accept almost any audio or video file you throw at them, but not every format is equal. Some are larger than they need to be. Some are compressed in ways that hurt accuracy. A few are still rejected by older tools. This post covers the formats you will encounter, which one to pick when you have a choice, and how to handle the awkward cases.

The Two Categories: Audio and Video

Transcription tools care about the audio. If you upload a video, the tool extracts the audio track first using FFmpeg, then runs the transcription pipeline on that. The video itself is ignored.

This means a 1 GB MP4 with the same audio as a 50 MB MP3 produces the same transcript. The MP4 is just slower to upload. For pure transcription work, audio-only formats are always more efficient than video formats. The exception is when you also want to keep the file as a video reference (subtitling work).

The video to text tool post covers video specifically. Here we focus on the format choices.

The Audio Formats You Will See

MP3

The default format for almost everything. Universal support, decent compression, small files.

  • Compression: Lossy. 128 kbps is acceptable, 192 kbps is good, 320 kbps is the practical max.
  • Sample rate: Usually 44.1 kHz, can be lower.
  • File size: 60-minute file at 192 kbps = 86 MB.
  • Best for: Podcasts, voice memos, almost any speech-only recording.

MP3 strips some high-frequency content during compression. At 192 kbps and above, this loss is negligible for transcription. At 64 kbps and below, fricatives ("f," "s," "sh") get fuzzy and accuracy drops.

The MP3 to text tool is the most-used CATT endpoint because MP3 is the most common input.

WAV

Uncompressed PCM audio. The lossless original.

  • Compression: None. Every sample stored explicitly.
  • Sample rate: Anything from 8 kHz to 192 kHz. 16 kHz is the transcription standard.
  • File size: 60-minute mono file at 16 kHz = 110 MB. Same at 44.1 kHz stereo = 600 MB.
  • Best for: Master recordings you will edit, professional audio work, situations where you want the highest possible accuracy.

WAV is overkill for upload. If your tool accepts it, fine; if you are worried about upload time, convert to a 192 kbps MP3 and lose essentially no accuracy.

The WAV to text tool handles WAV natively without re-encoding.

M4A

The default for Apple devices. Sometimes called AAC inside an MP4 container.

  • Compression: Lossy (AAC codec). Slightly more efficient than MP3 at the same bitrate.
  • Sample rate: Usually 44.1 kHz.
  • File size: 60-minute file at 128 kbps = 58 MB.
  • Best for: iPhone Voice Memos, GarageBand exports, any Apple-origin audio.

M4A and AAC are nearly identical (M4A is a container, AAC is the codec inside it). Most tools handle both interchangeably.

The M4A to text tool is the right entry point if your source is an iPhone Voice Memo or GarageBand export.

FLAC

Lossless compression. Same quality as WAV, half the file size.

  • Compression: Lossless.
  • Sample rate: Typically 44.1 kHz or 48 kHz.
  • File size: 60-minute stereo file = 300 MB.
  • Best for: Archival, situations where you cannot lose any quality but want smaller files than WAV.

FLAC is rarely the right transcription format because the accuracy gain over MP3 192 kbps is essentially zero. It is mostly used in audiophile and archival workflows.

The FLAC to text tool accepts it directly.

OGG / Opus

Open-source alternatives. Opus in particular is the codec WhatsApp and many WebRTC apps use.

  • Compression: Lossy (Opus is very efficient, often beating MP3 and AAC).
  • File size: 60-minute file at 64 kbps Opus = 29 MB.
  • Best for: WhatsApp voice messages, WebRTC recordings, anything coming from open-source tooling.

The OGG to text tool handles both OGG/Vorbis and OGG/Opus.

AAC

The raw codec, usually inside an M4A container but sometimes alone.

  • Compression: Lossy. Slightly better than MP3 at the same bitrate.
  • File size: Similar to M4A.
  • Best for: Anything sourced from iTunes Store, Apple Podcasts, or modern broadcast workflows.

The AAC to text tool is essentially identical to the M4A path.

WMA

Windows Media Audio. Rare in 2026 but still encountered on legacy systems.

  • Compression: Lossy (Microsoft's own codec).
  • Best for: Legacy Windows recordings; convert to MP3 if you can.

The WMA to text tool handles it but conversion to MP3 first is often easier if you are processing many files.

The Video Formats You Will See

MP4

The default video container. The audio inside is usually AAC; the video can be H.264, H.265, or others.

  • Best for: YouTube downloads, screen recordings, smartphone video, Loom exports.
  • Audio quality: Usually 128-192 kbps AAC. Good for transcription.

The MP4 to text tool extracts the audio track and transcribes it.

MOV

Apple's video container. Functionally similar to MP4.

  • Best for: QuickTime recordings, iPhone video, ScreenFlow, Final Cut Pro.

The MOV to text tool handles MOV natively.

WebM

Open-source video format. Common for Google Meet recordings and some browser-based recording tools.

  • Audio: Usually Opus.
  • Best for: Google Meet recordings, web-based recording tools, Twitch downloads.

The WebM to text tool extracts and transcribes the audio.

MKV

Matroska container. Flexible, often used for high-quality video downloads.

  • Best for: Conference recordings, downloaded talks.

The MKV to text tool accepts MKV files directly.

AVI, WMV, FLV

Older formats. Still occasionally encountered.

If you have a choice, convert to MP4 before uploading. The files will be smaller and the transcription pipeline will run faster (no esoteric codec decoding).

Picking the Right Format When You Have a Choice

If you are recording new audio for transcription, the practical default is:

  • MP3 at 192 kbps mono for speech-only content.
  • M4A at 128 kbps AAC if you are on an Apple device and want native format.
  • WAV at 16 kHz mono if you want the highest possible accuracy and do not mind the file size.

For video work, MP4 with H.264 video and AAC audio is the safe default. It transcribes well and plays everywhere.

Mono vs. Stereo

For transcription, mono is almost always the right choice. Speech is mono by nature; the speaker's voice is the same on both channels of a stereo recording.

The exception is interviews recorded with one speaker on the left channel and another on the right (which some tools call "split-track recording"). In that case, stereo is useful for diarization, because the tool can use channel information to label speakers without doing acoustic clustering. Most tools that support this expose a "use channel info for diarization" option.

For everything else, mono cuts file size in half with no accuracy cost.

Sample Rate

The standard for transcription is 16 kHz. The speech models are trained on it. Higher rates (44.1 kHz, 48 kHz, 96 kHz) are common in music and broadcast but unnecessary for transcription.

If your tool resamples to 16 kHz internally (almost all do), uploading 44.1 kHz audio just means a longer upload. For low-bandwidth situations, resample to 16 kHz before uploading.

File Size Limits

Most transcription tools have a per-file size limit. Common ones in 2026:

  • Free tiers: 25-200 MB or 30-60 minutes.
  • Paid tiers: 1-5 GB or 4-8 hours.
  • API: Often per-call limits matching paid tiers; for very large files, chunk client-side.

If your file exceeds the limit, the practical options:

  1. Convert to a more efficient format. A 600 MB WAV becomes a 50 MB MP3.
  2. Split the file at natural pauses. Audacity, FFmpeg, or any audio editor.
  3. Use a tool with higher limits. Pricing and tier comparisons matter here.
  4. Use the URL upload feature. Many tools (including CATT) accept a URL and download server-side, which avoids your upload bandwidth.

When Format Becomes a Problem

Three situations where the format actually causes trouble:

  1. DRM-protected files. iTunes purchases from before 2009, some audiobook files. You cannot transcribe these without removing the DRM first (which is often a licensing violation).
  2. Variable-bitrate edge cases. Some VBR-encoded files have malformed headers that confuse decoders. Re-encoding to constant bitrate fixes this.
  3. Multi-channel surround sound. A 5.1 audio track from a film recording is unusual. Most tools downmix to mono automatically; some refuse.

If you are stuck on a problematic format, converting through FFmpeg to a clean MP3 or WAV is almost always the fix:

ffmpeg -i input.weird-format -ac 1 -ar 16000 -b:a 192k output.mp3

That command produces a mono, 16 kHz, 192 kbps MP3 from anything FFmpeg can read.

A Quick Reference Table

FormatTypeCompressionBest Use
MP3AudioLossyDefault for speech
WAVAudioNoneMaster recordings
M4AAudioLossy (AAC)Apple ecosystem
FLACAudioLosslessArchival
OGG/OpusAudioLossyWhatsApp, WebRTC
AACAudioLossyiTunes, broadcast
WMAAudioLossyLegacy Windows
MP4VideoLossyDefault video
MOVVideoLossyApple video
WebMVideoLossyGoogle Meet, browser
MKVVideoVariableHigh-quality downloads
AVI/WMV/FLVVideoLossyLegacy

What to Do Next

Look at the format of the file you are about to transcribe. If it is MP3, WAV, M4A, or MP4, just upload it. If it is something obscure, convert to MP3 first and save yourself the headache.

Try transcription free

Convert any audio or video to accurate text in seconds. Speaker labels, timestamps, and AI summaries included. No account required.

Related Articles