transcriptionfile formatsaudio

Transcription File Formats Explained: A 2026 Input Guide

BMMamane B. MoussaMay 26, 2026Updated July 2, 202611 min read

Summarize this article with:

TL;DR

Audio and video transcription tools accept a wide range of formats by extracting the audio track first, so your container choice matters less than you might expect. For speech recordings, MP3 at 192 kbps mono is the practical default: small, universal, and accurate enough. WAV and FLAC preserve every sample but rarely improve transcript quality on clean recordings. When you have no choice of format, a single FFmpeg command can convert almost anything into a clean, 16 kHz mono MP3 before you upload.

The format of your audio file is almost never the reason a transcript comes out wrong. Most transcription tools accept MP3, WAV, M4A, FLAC, OGG, MP4, MOV, and a dozen more. This post maps each format, explains when the choice actually matters, and tells you what to do with the awkward cases.

For a quick yes/no compatibility table across tools, see the supported audio formats for transcription reference. The output side, SRT, VTT, and TXT, is covered in transcription export formats explained.

Audio vs. Video: What Transcription Tools Actually Process

When you upload a video file, the transcription tool extracts the audio track first (using FFmpeg or an equivalent), then runs the speech model on that audio. The video frames are ignored entirely.

A 1 GB MP4 and a 50 MB MP3 carrying identical audio produce identical transcripts. The MP4 just takes longer to upload. For pure transcription work, audio-only formats are always the more efficient choice. The exception is subtitle work, where you want to keep the video file as a reference.

The Audio Formats

MP3

The practical default for almost everything. Universal support, small files, no compatibility surprises.

Compression: Lossy. 128 kbps is acceptable; 192 kbps is the accuracy sweet spot for speech; 320 kbps is the practical ceiling.
Sample rate: Usually 44.1 kHz. Transcription models internally resample to 16 kHz.
File size: 60-minute file at 192 kbps mono = roughly 86 MB.
Best for: Podcasts, voice memos, interviews, any speech-only recording where you want small files.

MP3 removes high-frequency content during encoding. At 192 kbps and above, the loss is negligible for transcription. At 64 kbps and below, fricatives ("f," "s," "sh") get muddy and accuracy drops noticeably.

WAV

Uncompressed PCM audio. Every sample is stored explicitly.

Compression: None.
Sample rate: 8 kHz to 192 kHz. 16 kHz is the speech-model standard.
File size: 60-minute mono file at 16 kHz = roughly 110 MB. The same recording at 44.1 kHz stereo balloons to around 600 MB.
Best for: Master recordings you plan to edit, professional broadcast work, or situations where you want no lossy steps in the chain.

WAV is often overkill for upload. If you are on a slow connection or near a size limit, converting to a 192 kbps MP3 loses essentially no accuracy. For a detailed head-to-head, see WAV vs. MP3 for transcription.

M4A

The default format for Apple devices. Sometimes described as "AAC inside an MP4 container."

Compression: Lossy (AAC codec). Slightly more efficient than MP3 at the same bitrate.
Sample rate: Usually 44.1 kHz.
File size: 60-minute file at 128 kbps = roughly 58 MB.
Best for: iPhone Voice Memos, GarageBand exports, any Apple-origin audio.

M4A and AAC are close to identical: M4A is the container, AAC is the codec inside it. Most tools handle both interchangeably. There is no reason to convert an iPhone Voice Memo before uploading.

FLAC

Lossless compression. Same audio quality as WAV, smaller file.

Compression: Lossless (no audio data is removed).
Sample rate: Typically 44.1 kHz or 48 kHz.
File size: 60-minute stereo file = roughly 300 MB.
Best for: Archival workflows where you cannot afford any lossy step and want smaller files than WAV.

FLAC is rarely the right format for transcription upload. The accuracy gain over a clean 192 kbps MP3 is essentially zero for speech, and the files are still much larger than MP3. It is most useful in audiophile and archival pipelines where the audio will be processed further.

OGG / Opus

Open-source formats. Opus in particular is what WhatsApp, Discord, and most WebRTC apps use for voice messages.

Compression: Lossy. Opus is highly efficient at low bitrates, often outperforming MP3 and AAC.
File size: 60-minute file at 64 kbps Opus = roughly 29 MB.
Best for: WhatsApp voice messages, WebRTC recordings, open-source tooling.

OGG is a container (like MP4); Vorbis and Opus are codecs that can live inside it. At 64 kbps and above, Opus preserves speech clarity well.

AAC

The raw codec, usually delivered inside an M4A container but occasionally as a standalone .aac file.

Compression: Lossy. Slightly more efficient than MP3 at the same bitrate.
File size: Similar to M4A.
Best for: Content from Apple Podcasts, iTunes Store, or modern broadcast workflows.

AAC and M4A are processed identically by transcription tools. You will rarely need to distinguish them in practice.

WMA

Windows Media Audio. Uncommon in 2026 but still present on legacy systems.

Compression: Lossy (Microsoft's proprietary codec).
Best for: Legacy Windows recordings. Convert to MP3 if you are processing files in bulk.

WMA is accepted by most modern tools, but it is worth converting to MP3 first if you have many files: it avoids any codec decoding edge cases downstream.

The Video Formats

MP4

The dominant video container. Audio inside is usually AAC; video is typically H.264 or H.265.

Best for: YouTube downloads, screen recordings, smartphone video, Loom exports.
Audio quality: Usually 128-192 kbps AAC, which transcribes well.

MOV

Apple's video container. Functionally similar to MP4.

Best for: QuickTime recordings, iPhone video, ScreenFlow, Final Cut Pro exports.

WebM

Open-source video format. Common for Google Meet recordings and browser-based recording tools.

Audio: Usually Opus.
Best for: Google Meet recordings, web-based capture tools, Twitch downloads.

MKV

The Matroska container. Flexible, often used for high-quality downloaded video.

Best for: Conference recordings, downloaded talks, high-definition video with multiple audio tracks.

AVI, WMV, FLV

Older formats you will still encounter.

AVI: Legacy Windows video. Accepted by most tools; convert to MP4 if you have a choice.
WMV: Legacy Windows Media. Same advice.
FLV: Legacy Flash video. Rarely seen post-2020 but some older recordings still use it.

For any of these, converting to MP4 before uploading produces smaller files and avoids any esoteric codec decoding.

When Format Choice Actually Matters

Video containers upload too: the audio track is extracted automatically

Most of the time, format choice does not matter: any modern tool will handle MP3, WAV, M4A, and MP4 without complaint. The cases where it does matter:

Low bitrate lossy files. An MP3 at 32 or 48 kbps, or a heavily compressed phone recording, introduces enough smearing of consonants that even the best model struggles. If you can re-record at a higher bitrate, do it before transcribing.

Converting lossy to lossless. If you take an MP3 and convert it to WAV, you get a large WAV file with the same data loss as the original MP3. The conversion is permanent; there is no way to recover the removed audio. The transcript will be the same as from the original MP3.

DRM-protected files. iTunes purchases from before 2009 and some audiobook files carry DRM that prevents extraction. Transcribing them requires removing the DRM first, which may be a licensing violation depending on your use case.

Variable-bitrate edge cases. Some VBR-encoded files have malformed headers that confuse decoders. Re-encoding to constant bitrate (CBR) fixes this reliably.

Multi-channel surround sound. A 5.1 audio track from a film recording is unusual as transcription input. Most tools downmix to mono automatically; some refuse the file.

Mono vs. Stereo

Mono halves the file size with no accuracy cost for single-speaker content. Speech is inherently mono: your voice is the same signal on both channels of a stereo recording.

The one exception is split-track interview recording, where one speaker is panned to the left channel and another to the right. Some tools can use that channel separation for speaker diarization, labeling speakers by channel without needing acoustic clustering. If your recorder supports split-track, and your transcription tool supports channel-based diarization, it is worth enabling. For everything else, record in mono.

Sample Rate

The speech-model standard is 16 kHz. That rate captures the full frequency range of human voice (which sits below 8 kHz) with room to spare. Higher sample rates (44.1 kHz from CD audio, 48 kHz from broadcast, 96 kHz from studio gear) add file size but no accuracy benefit for speech recognition.

Virtually all transcription tools resample internally to 16 kHz. Uploading 44.1 kHz audio just means a larger upload for the same result. On a slow connection, resampling before upload saves time:

ffmpeg -i input.wav -ac 1 -ar 16000 output_16k.wav

File Size and the Practical Limits

Transcription tools cap file size or duration, with free tiers being more restrictive than paid tiers. The pattern is consistent across tools: free plans limit either by total minutes per month or per-file duration; paid plans raise those caps significantly or remove them.

If your file bumps against a limit, your options are:

Convert to a more efficient format. A 600 MB WAV becomes roughly 50 MB as a 192 kbps MP3, often enough to clear a size limit without any accuracy cost.
Split the file at a natural pause. Audacity and FFmpeg both handle this.
Use URL-based upload if available. Some tools, including ConvertAudioToText, accept a URL and download the file server-side, bypassing your upload bandwidth entirely.

The FFmpeg Escape Hatch

If you are stuck on a problematic format, converting to a clean mono MP3 via FFmpeg fixes almost every edge case:

ffmpeg -i input.weird-format -ac 1 -ar 16000 -b:a 192k output.mp3

That command produces a mono, 16 kHz, 192 kbps MP3 from anything FFmpeg can read (which is nearly everything).

Format Map: Quick Reference

Format	Type	Compression	Typical Use
MP3	Audio	Lossy	Default for speech
WAV	Audio	None	Master recordings
M4A	Audio	Lossy (AAC)	Apple devices
FLAC	Audio	Lossless	Archival
OGG / Opus	Audio	Lossy	WhatsApp, WebRTC
AAC	Audio	Lossy	Apple Podcasts, broadcast
WMA	Audio	Lossy	Legacy Windows
MP4	Video	Lossy	Default video
MOV	Video	Lossy	Apple video
WebM	Video	Lossy	Google Meet, browser tools
MKV	Video	Variable	High-quality downloads
AVI / WMV / FLV	Video	Lossy	Legacy

My take: the single most useful thing in this table is recognizing that the Video rows all reduce to "audio extraction first." Once you internalize that, worrying about container choice for video files largely disappears.

Picking a Format When You Have a Choice

If you are recording new audio specifically for transcription:

MP3 at 192 kbps mono for speech-only content. Small, universal, accurate.
M4A at 128 kbps AAC if you are on an Apple device and want the native format.
WAV at 16 kHz mono if you have storage space and want no lossy steps in the chain.

For video work, MP4 with H.264 video and AAC audio at 128 kbps or above is the safe default.

FAQ

Does format choice actually affect transcription accuracy?

Rarely in practice. The recording quality underneath the format matters far more than the container or codec. A clean, well-recorded MP3 at 192 kbps will produce a transcript nearly identical to a WAV of the same audio. The exceptions are very low bitrates (below 64 kbps), heavily compressed phone audio, and files with corrupted headers.

Should I convert my file to WAV before uploading?

Only if your source is already WAV or lossless. Converting a lossy format like MP3 or M4A to WAV does not restore any audio data: the original compression decisions are permanent. The conversion just inflates the file size with no accuracy benefit.

What is the best format for transcribing an iPhone Voice Memo?

M4A, which is the native format iPhones use. There is no reason to convert it first. M4A uses the AAC codec and is accepted directly by all modern transcription tools.

Why do transcription models prefer 16 kHz audio?

Human speech sits below 8 kHz. A 16 kHz sample rate captures the full voice frequency range with no loss for speech recognition purposes. Higher rates (44.1 kHz, 48 kHz) contain frequencies the model ignores, so uploading them just means a slower upload without a transcript improvement.

What should I do if my file is too large to upload?

Three options: convert to a more efficient format (a 600 MB WAV becomes roughly 50 MB as a 192 kbps MP3), split the file at a natural pause using Audacity or FFmpeg, or check whether your transcription tool accepts a URL so it can download the file server-side and bypass your upload bandwidth entirely.

Sources

AssemblyAI: Best Audio File Formats for Speech-to-Text - checked 2026-07-02
ConvertAudioToText Pricing Page - checked 2026-07-02

Try transcription free

Convert any audio or video to clean, unwatermarked text — speaker labels, timestamps, and AI summaries included. First 30 minutes free, no account.

transcriptionaudio

How to Convert AAC to Text: Streams vs M4A Explained

AAC to text: the raw-stream vs M4A container distinction that trips tools, broadcast origins, and the reliable workflow.

May 26, 202610 min