transcriptionaudio qualitytechnical

Improve Audio Quality Before Transcription: What Helps

BMMamane B. MoussaApril 14, 2026Updated July 2, 202612 min read

Summarize this article with:

TL;DR

Processing existing audio before upload can recover 10 to 20 percentage points of transcription accuracy on problematic recordings. The chain is: trim, denoise, normalize, EQ, convert to mono, then export. The catch: clean audio needs none of this, and audio with severe clipping or heavy reverb resists it too. Know which of the three buckets your file falls into before you touch an editor.

Processing your audio before upload is worth doing only when the recording has an identifiable problem: constant hum, inconsistent volume, muddy low frequencies, or a stereo file where both channels carry the same voice. On moderately noisy recordings, the full processing chain can recover 10 to 20 percentage points of accuracy. On clean audio, it contributes nothing. This guide covers the chain in the order it must be applied, and flags the two situations where you should skip it entirely.

Processed audio uploads the same way: the gains show in the transcript

For guidance on preventing these problems at recording time, see recording environment for best results. For dealing with very poor source material that pre-upload processing cannot fix, see transcribing with poor audio quality.

The Three Buckets First

Before opening an audio editor, sort your file:

Clean: You can hear every word clearly at normal listening volume. Transcribe directly. Processing will not help and may introduce artifacts.

Recoverable: Background hum, uneven speaker volumes, muffled speech, or a quiet recording with audible but not overwhelming noise. This is where the processing chain earns its keep.

Unrecoverable: You cannot make out significant portions even when listening carefully with headphones. Heavy clipping (harsh, crackling distortion throughout), simultaneous speakers across the whole file, or a mic in another room. Re-record if you can. Processing will not produce a usable transcript.

Most recordings with obvious but tolerable noise fall in the recoverable bucket.

Step 1: Trim First

Before any audio processing, trim unnecessary sections: the pre-meeting silence, the "can everyone hear me" preamble, or the dead air at the end. Trimming serves two purposes. First, it removes sections that can skew the noise profile you will need in step 2. Second, it shortens the file, which cuts processing time for every step that follows.

Tools for trimming: Voice Memos on iOS, the built-in Sound Recorder on Windows, or Audacity on any platform. All three support simple cut-and-export without touching signal quality.

Do not split a long recording into many tiny fragments to "parallelize" transcription. Splitting at natural topic boundaries (every 20 to 30 minutes, at a clear pause) is reasonable. Splitting into 5-minute chunks introduces context loss at each boundary and can break speaker continuity.

Step 2: Noise Reduction

This step targets constant background noise: air conditioning, fan hum, HVAC rumble, fluorescent light buzz. These are called steady-state noises because they occupy predictable, consistent frequency bands. Noise reduction works by sampling a short segment of the noise alone, building a fingerprint, and subtracting it across the whole file.

Audacity (free, audacityteam.org) is sufficient for most transcription prep. Here is the sequence:

Open your trimmed file in Audacity.
Find 1 to 2 seconds of audio where nobody is speaking but the background noise is present.
Select that section. Go to Effect, then Noise Reduction, then click "Get Noise Profile."
Select the entire file (Ctrl+A or Cmd+A).
Return to Effect, then Noise Reduction. The recommended defaults are: Noise Reduction 6 dB, Sensitivity 6, Frequency Smoothing 6. These are conservative settings that prioritize speech preservation over maximum noise removal.
Use the "Residue" toggle to hear what will be removed. If speech is audible in the residue, you are being too aggressive.
Preview, adjust if needed, then apply.

My take: start with 6 dB and increase in 3 dB increments if noise remains obvious. Anything above 15 dB starts producing robotic artifacts on the voice, which usually hurts transcription accuracy more than the original noise did.

Intermittent noises, a dog barking, a door slam, a phone notification, do not respond to this technique. You can manually select and silence those moments (Edit, Silence Audio in Audacity) if they are brief and surrounded by clear speech.

Alternatives to Audacity:

Adobe Podcast Enhance (podcast.adobe.com/enhance) runs entirely in the browser. Free with an Adobe account, it handles files up to 30 minutes and 500 MB per session, with a 1-hour daily cap. It applies AI-based noise removal in one click with no parameter tuning. The tradeoff is less control: you get the result Adobe's model produces, not the result you dial in.

iZotope RX is the professional option. RX 12 Elements starts at $99, Standard at $399, and Advanced at $1,399. For transcription prep, Elements handles everything you will realistically need: Dialogue Denoise, De-reverb, and Repair Assistant. The Advanced tier targets broadcast and post-production workflows that go beyond a transcription cleanup task.

Step 3: Normalize Volume

Uneven volume is the second most common problem. One speaker is close to the mic, another is across the table. The recording fades halfway through. Parts are fine but a few sections are nearly inaudible.

Normalization comes in two flavors:

Peak normalization raises the overall level so the loudest moment hits a target ceiling, typically -1 dB. In Audacity: Effect, then Normalize, then set peak amplitude to -1.0 dB. This is the right choice when the recording is consistently quiet.

Dynamic range compression reduces the gap between the loudest and quietest moments. It brings up the quiet speaker and reins in the loud one. In Audacity: Effect, then Compressor. A 3:1 ratio with a threshold around -20 dB is a reasonable starting point for speech with multiple participants at different volumes. Adjust by ear using the Preview button.

If your file is quiet because the microphone was far from the speaker, peak normalization will also amplify the background noise. In that case, run noise reduction first (step 2), then normalize the cleaner result.

For command-line workflows, the ffmpeg-normalize tool (pip install ffmpeg-normalize) applies EBU R128 loudness normalization in one command. The equivalent raw ffmpeg filter is:

ffmpeg -i input.wav -af "loudnorm=I=-16:TP=-1.5:LRA=11" output.wav

EBU R128 targets -16 LUFS, the standard for podcast delivery, and is a reasonable target for pre-transcription processing.

Step 4: EQ for Speech Clarity

Equalization adjusts the balance of frequencies. For transcription prep, the goal is narrow: cut what is not speech, preserve what is.

Human speech intelligibility sits between roughly 300 Hz and 3,400 Hz. That is the frequency band phone calls have used for decades, it is enough for understanding speech, even without the full range.

A three-move EQ setup works for most recordings:

High-pass filter at 80 to 100 Hz. Cuts low-frequency rumble from traffic, HVAC, and microphone handling noise that does not carry any speech content.
Low-pass filter at 8,000 Hz. Reduces high-frequency hiss. Most speech energy is well below this; cutting above it removes noise with minimal effect on clarity.
Gentle boost of 1 to 3 dB in the 1,000 to 3,000 Hz range. This is the consonant-heavy region where intelligibility lives.

In Audacity, apply these via Effect, then Filter Curve EQ (for precise control) or Effect, then High-Pass Filter and Low-Pass Filter for the cutoff steps individually.

EQ is the most optional step here. If noise reduction and normalization produced a recording you can follow easily, skip EQ. It helps most on muffled recordings (weak high frequencies) or recordings with heavy low-end rumble.

Step 5: Convert to Mono

Stereo audio carries two channels. For a single speaker, a phone call recording, a solo podcast, an interview, both channels usually contain the same voice signal. Keeping it stereo doubles the file size with no accuracy benefit for transcription.

Converting to mono halves the file size and, in some cases, improves transcription because the model processes one consistent signal rather than two channels that may have slight phase differences or different signal levels.

In Audacity: Tracks, then Mix, then Mix Stereo Down to Mono. Or, at the command line:

ffmpeg -i input.wav -ac 1 output_mono.wav

Skip this step if your recording has genuinely different content on each channel (some dual-channel interview setups, broadcast recordings where the host and guest are on separate tracks). In those cases, mix the channels properly rather than just dropping to mono.

Step 6: Export in the Right Format

The processing chain ends with an export. Format matters less than the quality of what you processed, but a few rules apply:

WAV or FLAC preserve everything the processing produced. WAV is uncompressed. FLAC is losslessly compressed, typically 40 to 60 percent smaller than WAV with identical quality. Either is the right choice if you want to archive the processed file or chain it into another tool.

MP3 at 128 kbps or above is practical for uploads when file size matters. A clean, processed recording at 128 kbps MP3 transcribes about as well as the same recording in WAV. Going below 128 kbps introduces audible compression artifacts that can confuse ASR models.

Avoid re-encoding between lossy formats. If your source is an MP3 already, export as MP3 or convert to WAV/FLAC, do not convert MP3 to AAC to M4A. Each lossy re-encode adds quality loss.

For transcription services including ConvertAudioToText, WAV, FLAC, MP3, M4A, and most other common formats are accepted. The format choice rarely changes the transcript; the recording quality does.

The Full Chain, Ordered

Step	Tool	When to Apply
1. Trim	Any audio player, Audacity	Always
2. Noise reduction	Audacity, Adobe Podcast Enhance, iZotope RX	When steady-state noise is present
3. Normalize	Audacity, ffmpeg-normalize	When volume is inconsistent or consistently low
4. EQ	Audacity Filter Curve EQ	When rumble or muffled quality is audible
5. Convert to mono	Audacity, ffmpeg	When file is stereo and content is the same on both channels
6. Export	Audacity, ffmpeg	Always, as WAV/FLAC/MP3 128+ kbps

Order is not optional. Noise reduction before normalization prevents amplifying noise. Normalization before EQ gives the EQ a stable level to work with. Re-ordering the chain degrades the result.

When Processing Is Not Worth the Time

Three situations where the effort does not pay off:

The audio is already clean. If it sounds clear to your ears, transcribe it directly. Processing clean audio risks introducing the very artifacts you are trying to avoid.

The damage is the wrong kind. Severe clipping (the distorted, crackling sound from recording too hot) cannot be meaningfully repaired by noise reduction or EQ. iZotope RX's Declip module recovers mild clipping, genuine word-level distortion throughout the file is not recoverable at home.

Echo or reverb is heavy. Room reflections baked into a recording are extremely difficult to remove. De-reverb tools (iZotope RX, Adobe Audition's Adaptive Noise Reduction) partially help, but heavy reverb typically reduces accuracy by 10 to 20 percentage points regardless of processing. If the recording sounds like it was made in a bathroom with the speaker 10 feet away, re-recording is faster than processing.

For recordings that fall into those failure modes, see transcribing with poor audio quality for what to do next, or how to improve transcription accuracy for non-audio approaches like choosing a better-suited model.

If you want clean transcripts without the processing overhead, ConvertAudioToText works well as a starting point, upload, transcribe, and see the baseline accuracy before committing to an editing pass.

For a deeper look at how background noise specifically interacts with AI transcription models, see dealing with background noise in transcription.

Frequently Asked Questions

How much can audio processing improve transcription accuracy?

On moderately noisy recordings, steady-state hum, low volume, muffled speech, the full chain (noise reduction, normalization, EQ) typically recovers 10 to 20 percentage points. Current AI transcription models reach 95 to 97 percent accuracy on clean studio audio and can fall below 60 percent on noisy real-world recordings. Processing moves recoverable audio up toward the clean end of that range. On already-clean audio, the improvement is essentially zero.

Does Audacity cost anything?

No. Audacity is free and open-source (audacityteam.org). It covers every step in this chain: noise reduction, normalization, compression, EQ, mono conversion, and export. Adobe Podcast Enhance is also free for files up to 30 minutes with a free Adobe account. iZotope RX starts at $99 for the Elements tier and is the right choice only when dealing with severely degraded recordings.

Should I always convert stereo to mono before transcribing?

Not always. Convert to mono when both channels carry the same content, a single speaker recorded in stereo, a phone call export, or a meeting recording where the stereo field adds nothing. Keep stereo (or properly mix the channels) when your source has different speakers or instruments on each side, as in some dual-track interview setups.

What is the correct order for noise reduction, normalization, and EQ?

Noise reduction first, then normalization, then EQ. Normalizing before noise reduction amplifies the background noise, making it harder to remove cleanly. Running EQ before normalization means the EQ is working against an unstable level. The order in this guide reflects how each step sets up the next one.

Does audio format (WAV vs MP3) affect transcription accuracy?

Mostly no, above 128 kbps. A clean, processed MP3 at 128 kbps transcribes about as well as the same audio in WAV or FLAC. Where format matters: do not re-encode between lossy formats (MP3 to AAC, for example), and avoid MP3 files below 128 kbps where compression artifacts start affecting speech clarity. If you are archiving the processed file, WAV or FLAC prevents any additional quality loss.

Sources

Audacity Noise Reduction manual: https://manual.audacityteam.org/man/noise_reduction.html
Adobe Podcast Enhance: https://podcast.adobe.com/en/enhance
iZotope RX 12 pricing: https://www.izotope.com/en/products/rx/pricing-options
AssemblyAI on audio format for speech-to-text: https://www.assemblyai.com/blog/best-audio-file-formats-for-speech-to-text
AssemblyAI on transcription accuracy: https://www.assemblyai.com/blog/how-accurate-speech-to-text
ffmpeg-normalize: https://github.com/slhck/ffmpeg-normalize
Sonix AI transcription accuracy trends: https://sonix.ai/resources/ai-transcription-accuracy-trends/

Try transcription free

Convert any audio or video to clean, unwatermarked text — speaker labels, timestamps, and AI summaries included. First 10 minutes free, no account.

noise reductionaudio quality

Background Noise and Transcription: Fix Existing Audio

How to diagnose background noise in existing recordings, which software tools actually help, and when re-recording beats repair for cleaner transcripts.

May 26, 202610 min

jargontechnical

Fix Jargon Errors in Transcription: The Glossary Pass

Your transcript turned "Kubernetes" into "cuban itties." Here is the systematic fix for technical jargon in AI transcripts, from quick find-replace to custom vocabulary APIs.

May 26, 202611 min