batchtranscriptionworkflow

Batch Transcription for Large Projects: The 2026 Playbook

BMMamane B. MoussaMay 26, 2026Updated July 2, 202610 min read

Summarize this article with:

TL;DR

Running batch transcription at scale requires a manifest-driven pipeline, not just a for-loop calling an API. This guide covers the full workflow for archive-scale jobs: file organization, API submission patterns, concurrency and rate limits, exponential-backoff retries, QC sampling, and cost comparison across Deepgram, AssemblyAI, AWS Transcribe, and OpenAI Whisper. The pilot-before-full-batch principle saves more time than any other single habit.

A 1,000-file archive is not a UI problem. The moment a project crosses roughly 50 hours of audio, the question stops being "which tool do I use" and starts being "how do I run this pipeline without losing track of what failed."

The manifest meets the uploader: batch jobs run file by file

This post is about building that pipeline. We will cover file organization, API submission patterns, parallelism, rate limits, failure handling, QC sampling, naming conventions, and cost math across the main APIs. The patterns apply whether you are migrating a podcast back-catalog, processing research interviews, or archiving depositions.

Before You Write a Single API Call: The Manifest

The biggest hidden cost in any large batch is reconstruction work, figuring out which files ran, which failed, and which outputs belong to which source after the job sprawls across hours or days.

A CSV manifest written before submission eliminates that cost. The columns that matter:

Column	Notes
`file_id`	Stable slug, not the raw filename (filenames drift)
`source_path`	Absolute path or storage URL
`duration_sec`	Pre-fill if you can; helps estimate cost
`language`	Set explicitly; do not rely on auto-detection at scale
`status`	`pending` / `submitted` / `completed` / `failed`
`job_id`	Returned by the API on submission
`output_path`	Where the transcript was saved
`error`	Error message or code if status is `failed`
`reviewed`	QC flag: blank / `pass` / `needs-review`

Update the manifest in place as the batch runs. A Python script that reads the manifest, skips completed rows, submits pending rows, and polls submitted rows is idempotent by construction. Restart it safely at any point.

Directory Structure That Scales

/batch-project/
    manifest.csv
    source/
        2024-Q1/
            interview-001-smith-2024-03-12.mp3
            interview-002-jones-2024-03-14.mp3
        2024-Q2/
            ...
    transcripts/
        2024-Q1/
            interview-001-smith-2024-03-12.txt
            interview-001-smith-2024-03-12.srt
            interview-001-smith-2024-03-12.json
        ...

The naming pattern {id}-{speaker}-{date}.{ext} makes it possible to reconstruct the manifest from the filesystem if something goes wrong. Avoid spaces in filenames, they break unsigned URLs in most storage systems.

Choosing the Right API

For batch projects, three factors determine which API to reach for: per-minute cost, concurrent request ceiling, and whether you need speaker diarization or domain vocabulary. Here is what the current pricing looks like, verified against vendor pages in early July 2026.

Provider	Model	Base Rate	With Diarization	Concurrent Limit
AssemblyAI	Universal-2	$0.15/hr	$0.17/hr	200 jobs
AssemblyAI	Universal-3.5 Pro	$0.21/hr	$0.23/hr	200 jobs
Deepgram	Nova-3 Monolingual	$0.46/hr	$0.58/hr	50 requests
Deepgram	Nova-3 Multilingual	$0.55/hr	$0.67/hr	50 requests
AWS Transcribe	Standard (US East)	$0.36/hr	N/A built-in	Varies by account
OpenAI	GPT-4o-mini-transcribe	$0.18/hr	N/A	Per rate-limit tier

AssemblyAI is the cost leader for raw transcription volume. Deepgram Nova-3 costs more but shows strong performance on noisy multi-speaker audio and offers keyterm prompting for domain vocabulary. AWS Transcribe bills in one-second increments with a 15-second minimum per request, that minimum inflates effective costs when your archive has many short clips.

For a 1,000-hour archive using AssemblyAI Universal-2 with diarization, expect roughly $170. The same job on Deepgram Nova-3 with diarization runs around $580. Both beat human transcription on clear audio by an order of magnitude.

See transcription pricing comparison 2026 and cost of transcription per hour for deeper breakdowns.

Submission Patterns: URL vs Upload

Use URL submission for files already in cloud storage. The API downloads from a signed URL; you never re-upload gigabytes of audio over your connection. Most storage systems (S3, R2, GCS) generate signed URLs with a straightforward SDK call.

Use direct multipart upload only for files that live locally and will not be stored elsewhere. At scale, the bandwidth and time cost of local-to-API uploads adds up fast.

The Submission Loop

The basic Python pattern, written to be restartable:

import csv
import time
import requests

API_KEY = "your_api_key"
BASE_URL = "https://api.assemblyai.com/v2"

def submit_job(url, language="en"):
    resp = requests.post(
        f"{BASE_URL}/transcript",
        headers={"authorization": API_KEY},
        json={
            "audio_url": url,
            "language_code": language,
            "speaker_labels": True,
        },
    )
    resp.raise_for_status()
    return resp.json()["id"]

def poll_job(job_id, timeout=600):
    deadline = time.time() + timeout
    while time.time() < deadline:
        resp = requests.get(
            f"{BASE_URL}/transcript/{job_id}",
            headers={"authorization": API_KEY},
        )
        data = resp.json()
        if data["status"] == "completed":
            return data
        if data["status"] == "error":
            raise RuntimeError(data.get("error", "unknown"))
        time.sleep(15)
    raise TimeoutError(f"Job {job_id} did not complete in {timeout}s")

For a 1,000-file batch, do not poll synchronously in a single thread. Submit a window of jobs, collect the IDs, then poll concurrently, or switch to webhooks once your endpoint is stable.

Parallelism and Rate Limits

Throttling the Submission Loop

Even with high concurrency ceilings, submitting all 1,000 requests in 30 seconds will trigger rate limits on most providers. A semaphore keeps you inside the bounds:

import asyncio
import aiohttp

CONCURRENCY = 50  # Stay under provider ceiling

async def submit_all(manifest_rows):
    sem = asyncio.Semaphore(CONCURRENCY)
    async with aiohttp.ClientSession() as session:
        tasks = [submit_with_sem(sem, session, row) for row in manifest_rows]
        return await asyncio.gather(*tasks, return_exceptions=True)

Add a small sleep between submissions (10-50ms) if you see 429 responses during the initial burst.

If you need the webhook vs polling tradeoff laid out for production systems, that post covers the decision tree.

Failure Handling

Language Handling in Mixed-Language Batches

Custom Vocabulary: Worth Setting Up First

Quality Control: Sample Before You Call It Done

The temptation on a 1,000-file batch is to accept all transcripts as done. The sample-and-review step takes 5-10 percent of project time and catches the systemic issues that would otherwise propagate into every downstream use of the transcripts.

What to Sample

Pull a random 5 percent of completed transcripts. Include at least one file from each major source type in the batch (different recording environments, different speakers, different equipment).

Listen to 2-minute segments from each sampled file while reading the transcript. Flag:

Consistent misidentification of a recurring term (add it to the vocabulary list and re-run the affected files)
Speaker misattribution on multi-speaker files (check if diarization was enabled; check audio quality)
Sections where confidence dropped (often correlates with background noise or overlapping speech)

Spot Checks on High-Stakes Files

Some files matter more than others. The anchor interview in a research project. The key deposition in a legal archive. The founding episode of a podcast. Pull those for closer review regardless of what the sample says. AI confidence scores do not map perfectly onto "this file was important."

Confidence Thresholds as a Signal, Not a Gate

Some providers return word-level confidence scores in the JSON response. Flag transcripts where average confidence falls below about 0.80 as candidates for human review, but do not auto-discard them. Low confidence on clear audio usually means domain vocabulary is unfamiliar. Low confidence on noisy audio usually means the audio needs re-recording or the file is a write-off.

Cost Math for Common Project Sizes

Project	Hours	Best API	Estimated Cost
Research interviews	200 hours	AssemblyAI Universal-2 + diarization	~$34
Podcast back-catalog	1,000 hours	AssemblyAI Universal-2	~$150
Legal archive	500 hours	Deepgram Nova-3 + keyterms	~$270
Legacy archive migration	5,000 hours	AssemblyAI Universal-2	~$750

CATT's $9.99/month unlimited plan works well for recurring monthly volumes under about 100 hours. For one-time large migrations, metered API pricing is more flexible, you pay for what you run without a monthly commitment.

See unlimited vs metered transcription pricing for the breakeven analysis. If you are building a product that calls these APIs on behalf of users, cost-optimizing transcription API calls and caching transcription results cover the two highest-impact levers.

If you just need to transcribe files without building a pipeline, a one-off batch or a quick turnaround on a single project, ConvertAudioToText handles audio, video, and URL sources without setup at /tools/audio-to-text.

Post-Transcription Processing by Project Type

Once the transcripts exist, the downstream work is project-specific:

Research projects: Topic modeling on participant turns, thematic coding in MAXQDA or NVivo, quote extraction for citation databases. The JSON output from most APIs includes word-level timestamps, which makes timestamped quotes easy to reconstruct.

Media archives: Search index construction (Elasticsearch or Meilisearch work well), metadata enrichment, episode-level summarization via an LLM pass over each transcript.

Legal archives: Keyword indexing, privilege review workflow, citation extraction. Custom vocabulary pays back many times over here, case names and statutory references need to be exact.

Content production: Repurposing transcripts into articles, newsletters, and social posts. See how to transcribe interview recordings for the editorial workflow.

The Right Build Order

For any new batch project, this sequence saves the most time:

Define the naming convention and directory structure.
Build or populate the manifest CSV.
Write the submission script with manifest read/write.
Run a 10-file pilot. Review the output.
Fix any issues found in the pilot (vocabulary, language settings, format problems).
Run the full batch.
Sample-and-review 5 percent.
Retry the failure queue.
Hand off to downstream processing.

My take: the pilot step is the most important item on this list. Catching a workflow problem at 10 files is cheap. Catching it at file 800, after the run has already cost real money and time, is expensive. The 10 files also give you a real accuracy sample to show stakeholders before committing to the full job.

FAQ

How many files can I submit in parallel to transcription APIs?

AssemblyAI processes up to 200 jobs simultaneously by default. Deepgram's pre-recorded endpoint allows up to 50 concurrent requests on Pay As You Go and Growth plans. Both caps are at the project level. Contact sales to request higher limits for sustained large-volume work.

What is a normal failure rate on a first-pass batch?

Expect 3-5 percent of files to fail on the first pass. A retry pass with exponential backoff typically resolves 60-80 percent of those failures. Remaining failures usually have a root cause: corrupted source audio, an unsupported codec, or a truly silent file. The manifest lets you triage them without re-running the successful files.

Should I use webhooks or polling for tracking batch job status?

For batches under a few hundred files, polling with a 15-30 second interval is simple and effective. For batches in the thousands, webhooks eliminate repeated HTTP overhead and reduce the chance of missing a completed job. The tradeoff: webhooks require a stable public endpoint, which adds infrastructure. See the webhook vs polling for transcripts post for the decision tree.

How do I handle a mixed-language archive?

Either pre-classify language per file before submission (a lightweight language-ID pass works), or use a multilingual model that handles all your languages on a single model tier. AssemblyAI Universal-2 covers 99 languages. Deepgram Nova-3 Multilingual covers 30-plus. Setting language explicitly on submission is always faster and more accurate than auto-detection at scale.

When does custom vocabulary matter enough to set up?

Any archive with domain-specific terminology, medical, legal, technical, or heavily branded, benefits from a vocabulary list. Both Deepgram (up to 100 keyterms per request) and AssemblyAI (up to 1,000 terms on Universal-3 Pro) support inline vocabulary hinting with no separate training step. Build the list once, use it across every file in the batch. The accuracy gain on known terms is significant.

Sources

Deepgram pricing, accessed July 2026: https://deepgram.com/pricing
Deepgram API rate limits, accessed July 2026: https://developers.deepgram.com/reference/api-rate-limits
Deepgram keyterm prompting docs, accessed July 2026: https://developers.deepgram.com/docs/keyterm
AssemblyAI pricing, accessed July 2026: https://www.assemblyai.com/pricing
AssemblyAI large-scale batch transcription guide, accessed July 2026: https://www.assemblyai.com/blog/large-scale-audio-transcription
Amazon Transcribe pricing, accessed July 2026: https://aws.amazon.com/transcribe/pricing/
OpenAI transcription pricing, accessed July 2026: https://developers.openai.com/api/docs/pricing

Try transcription free

Convert any audio or video to clean, unwatermarked text — speaker labels, timestamps, and AI summaries included. First 30 minutes free, no account.

ai-agentstranscription

Agentic Transcription Systems: Real Patterns for 2026

Agentic transcription goes beyond raw text. Learn the three real agent loops (verify-and-retry, terminology lookup, summary chains) that work in production today.

May 26, 202611 min

contentzoom

Repurpose Zoom Recordings into Blog Posts (2026 Workflow)

Turn a 45-minute Zoom call into a publishable 1,500-word article in under 90 minutes. The exact workflow: consent, transcribe, extract, and write.

May 26, 202610 min