
Transcription vs. Captioning vs. Subtitles: What's Actually Different
Transcription, captioning, and subtitling are often treated as synonyms. They are not. Each one solves a different problem, ships in a different format, and has different rules about what should and should not be included. This post clears up the distinctions, covers the format requirements that bite, and helps you pick which one you actually need.
The Quick Definitions
- Transcription turns audio into text. Output is a document. The audience reads it.
- Captioning turns audio into on-screen text that syncs to video, written for viewers who cannot hear. Includes sound effects, speaker IDs, music cues.
- Subtitling turns dialog into on-screen text, often translating into another language. Assumes viewers can hear but cannot understand the spoken language.
A podcast transcript on a website is transcription. The "CC" button on YouTube is captioning. The English text under a French film is subtitling. Three things, three different outputs.
Where the Confusion Comes From
The terms blur because the underlying tech is similar (speech recognition + timed text) and because different industries use them differently:
- In the US, "captions" usually means closed captions (CC) made for accessibility, and "subtitles" usually means same-language or translated dialog tracks.
- In the UK and much of Europe, "subtitles" is used for both, and "captions" is rarer.
- Streaming platforms (Netflix, Disney+, Amazon Prime) often list both as "subtitles" in the UI but treat them differently internally.
Once you understand what each one actually contains, the terminology becomes secondary. You can usually figure out which one a platform means from context.
Captions: Made for Viewers Who Cannot Hear
A caption track is built for someone watching a video with no audio. That changes what goes in it:
- Speaker identification. "Alice:" and "Bob:" or "[Alice]" before each utterance.
- Sound effects. "[door slams]," "[phone ringing]," "[laughter]."
- Music cues. "[upbeat music]," "[ominous music]," "[song lyrics]."
- Off-screen speakers. Identified explicitly.
- Tone indicators when not obvious from text. "[sarcastically]," "[whispering]."
Closed captions (CC) can be turned on and off by the viewer. Open captions are burned into the video and cannot be removed. Closed is the standard for streaming; open is used in social media (TikTok, Instagram) where most viewers watch on mute.
US accessibility law (ADA, FCC) requires closed captions on many types of public-facing video. Quality standards apply. Auto-generated captions are usually not legally sufficient if a deaf user could miss meaning.
For social-media open captions (TikTok, Reels), the TikTok video transcription tool and Instagram Reel transcription tool produce the right format.
Subtitles: Made for Viewers Who Cannot Understand the Language
Subtitles assume the viewer can hear. They include only what is necessary to follow the dialog:
- Dialog text, typically translated.
- No sound effects. The viewer can hear the door slam.
- No music cues. The viewer can hear the music.
- No speaker IDs unless ambiguous on screen.
A Spanish-to-English subtitle of a film translates the dialog and that is it. The subtitle assumes you can hear the gunshot, the door slam, and the music swell.
For language work, the Spanish transcription tool, French transcription tool, and Portuguese transcription tool produce the original-language transcript. Translation is a separate step on top.
Transcription: Made for Reading
A transcript is a document. Sometimes it has timestamps for navigation, but the primary use is reading, not watching. That changes the format:
- Paragraph-level structure for readability.
- Speaker labels when there is more than one.
- No sound effects (unless they are load-bearing in the narrative).
- Punctuation and formatting optimized for reading, not for screen display.
- No 32-character line limits (it is a document, not a caption track).
If a viewer can choose between a transcript and a video, the transcript should read like a polished interview, not like a caption track copied into a text file.
The audio transcription guide and verbatim vs. clean transcription post cover the document-style conventions in detail.
File Format Comparison
Each one has its preferred file formats:
| Output | Common formats |
|---|---|
| Transcription | TXT, DOCX, PDF |
| Captioning | SRT, VTT, SCC, TTML |
| Subtitling | SRT, VTT, SBV |
There is overlap (SRT and VTT serve both captioning and subtitling), and many tools export to all of them. The post on SRT, VTT, TXT export formats breaks down what each format contains and how players consume them.
Closed Captions vs. Open Captions
Two more terms to know.
Closed captions are stored separately from the video and toggled on/off by the viewer. YouTube's CC button, Netflix subtitles, the embedded SRT in a video file. Closed captions are accessible and removable. The video is "clean" and the caption track is a separate layer.
Open captions are burned into the video pixels. They cannot be turned off because they are part of the image. TikTok, Instagram Reels, and most short-form social media use open captions, because:
- Most viewers watch on mute.
- The platforms do not always honor closed-caption tracks reliably.
- The captions can be styled freely without depending on player capabilities.
For social platforms where most viewing happens with sound off, burn the captions in. For YouTube, longer videos, and accessibility-first contexts, use closed captions. Many video creators produce both: open for the social cuts, closed for the long-form upload.
Same-Language Captioning vs. Translated Captioning
A YouTube video in English can have English captions (same-language) and Spanish captions (translated). Same format (SRT/VTT), different content:
- Same-language captions are essentially captioning with speaker IDs and sound cues.
- Translated captions are subtitling: dialog translated, sound cues usually omitted.
The transcription pipeline produces the same-language version. Translation is a downstream step. Some tools chain both (transcribe + translate); others require you to do it in two steps.
When You Actually Need Each One
Map your use case to the right output:
| Goal | Right output |
|---|---|
| Blog post or article from an interview | Transcription |
| Podcast show notes | Transcription |
| YouTube video accessible to deaf viewers | Closed captions |
| YouTube video for non-English speakers | Translated subtitles |
| TikTok / Reels for mute viewers | Open captions, burned in |
| Live meeting recording for the team | Transcription |
| Lecture or course video | Closed captions + transcript |
| Film for international release | Translated subtitles |
| Public-facing video for ADA compliance | Closed captions |
For most content creators, the answer is "I need a transcript AND captions." That is fine. Most modern tools produce both from a single transcription run; the free English tool and transcribe-platform tools export TXT for reading and SRT/VTT for embedding.
Practical Workflow for All Three
If you are producing a video that will also live as a blog post and also be subtitled:
- Transcribe the audio with sentence-level timestamps.
- Export TXT for the blog post version. Edit for readability.
- Export SRT for the same-language closed captions. Edit for cue length and readability.
- Translate the SRT for foreign-language subtitle tracks (a separate translation step, usually with a different tool or a human translator).
- For social cuts, burn the captions into the video using a tool that supports it.
This produces three to five deliverables from one transcription run. The cost is one transcribe call plus some editing time; the value is several different audiences served from the same source.
Accessibility, Quality, and Legal Notes
If you are captioning for accessibility (ADA, Section 508, FCC, EU EAA), there are quality standards beyond "the text matches the audio":
- 99%+ accuracy is the legal expectation for most contexts.
- Synchronization within 1-2 seconds of the audio.
- Speaker identification when there is more than one.
- Sound effects and music cues when they convey meaning.
- Readable line length (usually 32-42 characters).
Auto-generated captions usually do not meet these standards out of the box. Plan for an edit pass. The how to add subtitles to video post covers the workflow in more depth.
What to Do Next
Identify what you are producing. If it is a document, you want transcription. If it is for hearing-impaired viewers, you want captioning. If it is for non-native-language viewers, you want subtitling. The same transcription run can produce all three; just pick the right export and editing pass for each.
Try transcription free
Convert any audio or video to accurate text in seconds. Speaker labels, timestamps, and AI summaries included. No account required.
Related Articles

Timestamps in Transcription: When to Use Them and How to Pick the Level
Word-level, sentence-level, paragraph-level: which timestamp granularity fits your use case? A guide to transcription timestamps, formats, and editing tips.

SRT, VTT, TXT, DOCX, JSON: Picking the Right Transcription Export Format
Each transcription export format serves a different purpose. Here's what SRT, VTT, TXT, DOCX, and JSON contain, when to use each, and the format quirks that bite.