transcriptioncaptioningsubtitles

Transcription vs Captioning vs Subtitles: 2026 Guide

BMMamane B. MoussaMay 26, 2026Updated July 2, 202611 min read

Summarize this article with:

TL;DR

Transcription produces a readable document from audio. Captioning produces timed, on-screen text for viewers who cannot hear, including sound effects and speaker IDs. Subtitles translate or transcribe dialog for viewers who cannot understand the language, and omit sound cues the viewer can already hear. The same transcription run can produce all three outputs; you just need the right export and edit pass for each. Translated subtitles do not satisfy accessibility law: only true captions do.

A transcript, a caption track, and a subtitle file are three different things. All three start from the same audio, but each solves a different problem, ships in a different format, and carries different content. Picking the wrong one either fails your audience or fails a compliance check.

The Three Artifacts, Mapped

	Transcript	Closed Captions	Subtitles
Audience	Readers	Viewers who cannot hear	Viewers who cannot understand the spoken language
Sound effects included?	No (unless load-bearing)	Yes, required	No
Music cues included?	No	Yes	No
Speaker IDs included?	When multiple speakers	Yes, required	Only if ambiguous on screen
Translated?	Rarely	Same language or translated	Usually translated
Typical formats	TXT, DOCX, PDF	SRT, VTT, SCC, TTML	SRT, VTT, SBV
Where legally required	Audio-only content (Section 508)	Pre-recorded video (WCAG SC 1.2.2), live video (SC 1.2.4)	No accessibility law requires them

That last row is the one that surprises people most: translated subtitles do not satisfy accessibility law. Only true captions do.

Transcription: Output Is a Document

What is the difference between a transcript and closed captions?

A transcript is a document: no time codes, optimized for reading, with paragraph structure and no line-length limits. Closed captions are a timed text track synchronized to a video, stored separately from the video so viewers can toggle them on or off. Both start with the same speech recognition pass, but the output format, content, and editing conventions are different.

A transcript is meant to be read, not watched. That changes everything about how it is formatted.

Paragraph-level structure for readability, not 32-character line breaks.
Speaker labels when there is more than one voice, written out for readability.
No time codes required (optional timestamps for navigation are fine).
Punctuation and formatting optimized for a reader, not a video player.
No line-length cap. No cue-end markers.

If you take a caption file and paste it into a document, you get something that reads like a printed ticker tape. A proper transcript is edited to read like a polished interview or article. The two things are structurally different even when the words are the same.

Common use cases: blog posts from interviews, podcast show notes, meeting records, academic papers, legal depositions, and content for search indexing.

Captions: Output Is a Synchronized Accessibility Track

Generate the caption track once, then export SRT or VTT per platform

What must closed captions include that subtitles do not?

Closed captions must include speaker identification, sound effects (e.g., [door slams], [phone ringing]), music cues ([upbeat music], [audience applause]), off-screen speaker identification, and tone indicators when the text alone is ambiguous. Subtitles carry dialog only, sometimes translated, because the viewer can hear everything else.

Captioning is designed for a viewer watching with no audio at all. Everything they would hear must appear on screen, which is why captions carry content that transcripts and subtitles do not.

Required elements:

Speaker identification. "Alice:" or "[Bob]" before each utterance, especially when speakers are off-screen or not visually identifiable.
Sound effects. "[door slams]," "[phone ringing]," "[crowd applause]."
Music cues. "[upbeat music playing]," "[ominous strings]," "[song: lyrics here]."
Tone indicators when the emotional meaning is not clear from the words alone. "[sarcastically]," "[whispering]."

Captions are also time-coded to the video: each cue has a start time and end time, and the text appears on screen in sync with the audio. The quality of that synchronization matters legally.

For social platforms, the open vs closed captions post covers the distinction in detail. The short version: closed captions are a toggle (the CC button on YouTube); open captions are burned into the video pixels and cannot be removed. Short-form social content (TikTok, Instagram Reels) typically uses open captions because most viewers watch on mute and platform closed-caption support is inconsistent. Learn more in the captioning for TikTok and Instagram guide.

Subtitles: Output Is a Dialog Track for Language Access

Do subtitles count as captions for accessibility compliance?

No. Subtitles assume the viewer can hear and therefore omit sound effects, music cues, and non-speech audio. Accessibility law (WCAG SC 1.2.2, ADA Title II, Section 508, EAA) requires captions that include all audio information needed to understand the content. A subtitle track that only carries dialog does not satisfy these requirements.

Subtitles assume the viewer can hear. They include only what is necessary to follow the dialog in another language (or occasionally in the same language for viewers who need reading support).

Dialog text, usually translated into the viewer's language.
No sound effects. The viewer can hear the door slam.
No music cues. The viewer can hear the score.
No speaker IDs unless completely ambiguous on screen.

A Spanish-to-English subtitle of a film translates the dialog and stops there. The subtitle track trusts that the viewer heard the gunshot, the laughter, and the silence.

The practical implication: if you have a subtitle track on your video but no caption track, you have served non-native-language viewers but not deaf or hard-of-hearing viewers. These are two separate deliverables.

A Note on Terminology

The terms are not universal. In the US, "captions" usually means the accessibility track (sound cues, speaker IDs) and "subtitles" usually means the dialog-only track. In the UK and much of Europe, "subtitles" covers both, and "captions" is rarely used. Streaming platforms (Netflix, Disney+, Amazon Prime) list both types under the word "subtitles" in their UI but treat them differently internally.

Netflix, for example, uses the term SDH (Subtitles for the Deaf and Hard of Hearing) for the accessibility track that carries sound cues. Internally, Netflix requires all SDH and subtitle files in TTML1 format (.xml or .ttml), not SRT or VTT, for its delivery pipeline. That is the context where the terminology gap causes real production problems.

File Formats: What Goes Where

What file format should I use for captions vs subtitles?

SRT is the most portable and works on YouTube, Vimeo, Facebook, and most video players. VTT (WebVTT) adds styling and positioning for HTML5 players and is preferred for web-hosted video. TTML (and its profile IMSC1.1) is required for professional broadcast and OTT delivery such as Netflix. SCC is used for North American broadcast workflows. For most independent creators and online courses, generate SRT first and convert downstream as needed.

Format	Best for	Key platforms
SRT	Portability, most online platforms	YouTube, Vimeo, Facebook, most players
VTT (WebVTT)	HTML5 web players, styling control	Vimeo, web-hosted video, HTML5 `track` element
SCC	North American broadcast workflows	Legacy broadcast and professional editing
TTML / IMSC1.1	OTT and streaming delivery	Netflix, broadcast, Disney+, Amazon Prime
TXT / DOCX	Transcripts for reading	Blog posts, documents, search indexing

The practical rule: generate SRT first for maximum portability, convert to TTML downstream for professional OTT delivery. SRT and VTT serve both captioning and subtitling; the content determines which type you have, not the file extension.

For a deeper look at what each format contains and how players parse them, see how to create an SRT file.

Same-Language vs Translated Tracks

A YouTube video in English can have an English caption track (accessibility, includes sound cues) and a Spanish subtitle track (dialog translated, no sound cues). Same file format, different content:

Same-language captions: accessibility track with speaker IDs and sound effects.
Same-language subtitles: dialog-only, sometimes used for comprehension support or noisy environments; not an accessibility substitute.
Translated captions (SDH in another language): sound cues translated alongside dialog; required when the content is originally in one language and the target audience may be deaf or hard of hearing in that language.
Translated subtitles: dialog translated, no sound cues; serves language access, not accessibility.

The transcription pipeline produces the source-language text. Translation is a downstream step, applied either to captions or to subtitles depending on the audience need.

When You Need Each One

Goal	Right output
Blog post from a recorded interview	Transcript (TXT or DOCX)
Podcast show notes	Transcript
YouTube video: deaf and hard-of-hearing viewers	Same-language closed captions (SRT or VTT)
YouTube video: non-English-speaking viewers	Translated subtitles (SRT or VTT)
TikTok or Instagram Reels	Open captions burned into video
Live meeting record for the team	Transcript
University lecture or online course	Closed captions + separate transcript
Film for international release	Translated subtitles per language
Federal agency or contractor video (Section 508)	Closed captions + descriptive transcript
Public-facing video (ADA Title II, EAA)	Closed captions (subtitles alone do not satisfy this)

For most content creators, the answer is "transcript and captions." That is fine. A single transcription run can produce both: export TXT for reading, export SRT for embedding. The subtitle generator produces SRT and VTT files from audio or video in one step.

Legal Requirements: What the Law Actually Says

Are auto-generated captions good enough for accessibility compliance?

Usually not on their own. The FCC uses 99% word accuracy as the benchmark for broadcast captions. Section 508 and WCAG guidelines state that auto-generated captions are insufficient unless confirmed fully accurate, because errors that alter meaning create an accessibility barrier. Plan for a human edit pass on auto-generated output before publishing for compliance purposes.

My take: the legal landscape is more specific than most posts acknowledge, and the distinctions between transcript, captions, and subtitles are exactly where organizations get it wrong.

WCAG (Web Content Accessibility Guidelines):

SC 1.2.2 (Level A): captions for pre-recorded audio in synchronized media.
SC 1.2.4 (Level AA): captions for live audio in synchronized media.
Transcripts alone satisfy audio-only content but not video with audio.
Subtitles do not satisfy either requirement.

ADA Title II (US): The DOJ's April 2024 final rule established WCAG 2.1 Level AA as the compliance standard for public entities. The first deadline hit April 24, 2026, for entities serving populations over 50,000.

Section 508 (US federal): Synchronized media requires captions (WCAG SC 1.2.2) plus audio descriptions for significant visual content. Transcripts are required separately for audio-only media.

FCC (US broadcast): Quality standards adopted in 2014 frame accuracy, synchronicity, program completeness, and caption placement as the four principles. The industry benchmark is 99% word accuracy. Auto-generated captions are not considered sufficient without a human review pass.

European Accessibility Act (EAA): Entered force June 28, 2025, covering businesses operating in or selling to EU residents. Audiovisual content must include synchronized captions or subtitles meeting quality standards, approximately 160-180 words per minute reading speed.

For a full breakdown of captioning compliance across jurisdictions, see WCAG compliance with transcripts and accessibility captions and ADA compliance.

This section is current as of mid-2026 and is provided for general information only, not legal advice. Verify requirements for your specific jurisdiction and use case.

What to Do Next

Identify your deliverable first. If the audience reads it, you need a transcript. If the audience watches with no audio, you need captions. If the audience watches but cannot understand the spoken language, you need subtitles. The same transcription run produces all three; pick the right export and edit pass for each.

If you need to generate a caption or subtitle file without building a full workflow, the subtitle generator at ConvertAudioToText handles audio and video files and exports SRT or VTT directly.

Sources

W3C WAI, "Captions/Subtitles": https://www.w3.org/WAI/media/av/captions/
W3C WAI, "Understanding SC 1.2.2 Captions (Prerecorded)": https://www.w3.org/WAI/WCAG22/Understanding/captions-prerecorded.html
Section508.gov, "Video and Other Synchronized Media": https://www.section508.gov/create/synchronized-media/
3Play Media, "FCC Closed Captioning Quality Standards": https://www.3playmedia.com/blog/fccs-new-quality-standards-closed-captioning-video-programming/
FCC, "FCC Adopts Closed Captioning Quality Standards for TV Programs": https://www.fcc.gov/fcc-adopts-closed-captioning-quality-standards-tv-programs
Netflix Partner Help Center, "Timed Text Style Guide: General Requirements": https://partnerhelp.netflixstudios.com/hc/en-us/articles/215758617-Timed-Text-Style-Guide-General-Requirements
Interprefy, "The European Accessibility Act 2025 Captioning Requirements": https://www.interprefy.com/resources/blog/european-accessibility-act-captioning-requirements
DOJ April 2024 Title II final rule, ADA.gov

Try transcription free

Convert any audio or video to clean, unwatermarked text — speaker labels, timestamps, and AI summaries included. First 30 minutes free, no account.

transcriptiontimestamps

Timestamps in Transcription: When and How Granular

Word-level, sentence-level, paragraph-level, or none at all: which timestamp granularity fits your use case? A practical guide to transcription timestamps.

May 26, 202610 min

transcriptionsubtitles

Transcription Export Formats: TXT, SRT, VTT, JSON

A practical reference for every transcription export format: what SRT, VTT, TXT, DOCX, PDF, and JSON contain, when to use each, and the format quirks that break things.

May 26, 202611 min