transcriptionjapanesekanjilanguages

Japanese Transcription with Kanji: A 2026 Guide

BMMamane B. MoussaMay 26, 2026Updated July 2, 202613 min read

Summarize this article with:

TL;DR

Japanese transcription is harder than it looks because the writing system itself is the challenge. A correct transcript uses kanji, hiragana, and katakana in the right places, handles homophones through context, and preserves keigo verb forms that carry social meaning. Modern models like Whisper Large-v3, Deepgram Nova-3, and AssemblyAI Universal-3 Pro all support Japanese with proper mixed-script output. Kansai and other regional dialects reduce accuracy noticeably compared to Tokyo-standard speech.

Japanese audio transcribes into three writing systems simultaneously, and getting all three right is what separates a usable transcript from a readable one. When a speaker says "kyou wa hareru deshou," the correct written form is "今日は晴れるでしょう" not "きょうはれるでしょう." Both represent the same sounds. Only one is how Japanese is actually written.

Japanese transcripts output mixed kanji and kana; translation is a separate step

This post covers how AI handles the script layer, the register layer, and the dialect layer for Japanese transcription in 2026, and what to look for when choosing a tool.

The Three-Script Problem

Japanese is written with three coexisting scripts:

Kanji are Chinese-origin characters used for content words, place names, and verbs (共有する, 東京, 行く).
Hiragana handles grammatical endings, particles, and function words (は, が, です, ている).
Katakana is used for loanwords and foreign proper nouns (マーケティング, アジェンダ, コーヒー).

A proper transcript uses all three in the right places. An all-hiragana transcript is the machine equivalent of writing English without any capital letters or punctuation, technically decodable but not production-ready.

Older or lighter speech-to-text systems sometimes output all-hiragana or all-katakana. This was common with early cloud APIs and still appears with small Whisper variants (tiny, base). Whisper Large-v3, Deepgram Nova-3, and AssemblyAI Universal-3 Pro all output standard mixed-script Japanese as of 2026, verified against their published language documentation.

No Spaces and the Tokenization Layer

English words are separated by spaces. Japanese is not. The sentence "今日は東京に行きます" has no word boundaries marked in the text at all. This creates a fundamental upstream challenge: before an AI can output the right kanji, it has to segment continuous phoneme streams into the right morphemes and then assign the right written form.

Modern end-to-end models handle this implicitly, but it explains why Japanese ASR errors look different from English errors. An English mistake substitutes one word for another. A Japanese mistake often produces a plausible-looking character string that is either a wrong kanji, a genuine word with a different meaning, or a combination that does not exist in the language.

Homophone Disambiguation: The Hardest Subproblem

Japanese has a large number of homophones: words that sound identical but are written with different kanji and carry different meanings. The word "kikan" can be 期間 (period of time), 機関 (engine or institution), 器官 (organ), or 帰還 (return), among others. The speaker's meaning is obvious from context. The AI has to use that context to select the right kanji.

Large models resolve most common homophones correctly. Whisper Large-v3 fixes homophone errors that Whisper Small produces on the same input. The failure mode narrows to technical vocabulary, rare compound words, and proper nouns (especially Japanese family names, where the same kanji can have multiple valid readings).

A practical fix: provide a glossary of domain-specific terms and proper names before transcribing. Most professional tools accept a vocabulary hint or prompt input that steers the model toward the correct kanji for recurring specialized terms.

For a deeper look at why model size matters for languages like Japanese, see the discussion of transcription accuracy factors.

On'yomi, Kun'yomi, and Polyphony

Most kanji have at least two reading systems: on'yomi (the Chinese-derived pronunciation, typically used in compound words) and kun'yomi (the native Japanese pronunciation, typically used when the kanji appears alone or with hiragana endings). The kanji 水 reads "sui" in 水曜日 (Wednesday) and "mizu" when it stands alone meaning water. Context usually determines the reading unambiguously, but for the most ambiguous characters (researchers count several hundred with genuinely context-dependent readings), even large models occasionally select the wrong one.

This is a different problem from homophones. Here the sound and the written form are both correct, but the model produces the wrong character for the context. These errors are comparatively rare in contemporary large models but remain more frequent for low-resource kanji compounds and personal names, where training data is thinner.

Keigo: The Register Layer

Japanese has three honorific registers used in business and formal contexts:

Sonkeigo raises the subject of an action (the other person does something important). "Irasshaimasu" instead of "kimasu" for "come."
Kenjougo lowers the speaker (the speaker does something humbly for the other person). "Itashimasu" instead of "shimasu" for "do."
Teineigo is the default polite form used in most business speech. "Desu" and "masu" endings.

A Japanese business meeting cycles through all three depending on speaker hierarchy and the action being described. This makes Japanese formal audio more varied in vocabulary than its English equivalent, even when the topic is the same.

AI transcription preserves keigo by preserving exactly what the speaker said. The model does not flatten "irasshaimasu" to "kimasu." The risk is the model misidentifying the verb form in noisy audio or unfamiliar contexts, not substituting a less formal equivalent. Standard polite speech (teineigo) is handled reliably by all three major engines. Complex sonkeigo constructions in highly formal ceremonial speech, academic ritual language, or certain regional dialects may surface occasional errors and benefit from a light editorial review.

For Japanese businesses, this matters because gijiroku (議事録), the formal meeting minutes expected in Japanese corporate culture, are legal-adjacent documents. A transcript that flattens the keigo layer loses information that Japanese readers extract from word choice to understand who was being deferential to whom.

Katakana Loanwords and Mixed Latin Inline

Japanese business and technology audio constantly mixes native Japanese with loanwords and acronyms. The correct written form for these follows predictable rules:

English origin words become katakana: "marketing" becomes マーケティング, "agenda" becomes アジェンダ.
Inline Latin acronyms stay in Latin script: OKR, KPI, SaaS, PR.
Some terms float between systems (マーケター vs marketing manager, both appear in practice).

A correctly functioning transcription engine for Japanese business audio produces katakana for recognized loanwords and preserves Latin script for acronyms. Engines that transliterate everything into kana or that convert loanwords back to English produce non-standard output.

My take: the katakana handling is one of the clearest quick tests for a Japanese transcription tool. Upload 30 seconds of a tech startup meeting and check whether "platform" comes back as プラットフォーム or something non-standard. Most large-model tools pass this test. Smaller or older APIs do not.

Punctuation: Full-Width or Wrong

Japanese uses different punctuation characters than Western text:

、 (toten) is the comma.
。 (kuten) is the full stop.
「」 are quotation marks for speech.
『』 are used for nested quotations or titles.

A transcript that returns "今日は晴れです, 散歩しました." with Western commas and periods is incorrect by Japanese typographic standards. In the original vertical-writing tradition (tategaki, where text runs top-to-bottom in columns), these characters also rotate and position differently from horizontal text, but most digital transcripts are output in horizontal format (yokogaki).

Well-supported engines produce proper full-width Japanese punctuation automatically when the language is set to Japanese. If your transcript output uses Western punctuation, check your language selection setting before assuming a model limitation.

Dialect Accuracy

Japanese has notable regional dialect variation. The engines are trained primarily on standard Tokyo Japanese (the dialect used in NHK broadcasts and most formal media).

Standard Tokyo Japanese and NHK-style formal speech transcribe most accurately across all major engines. Kansai dialect (Osaka, Kyoto, Kobe) has distinct pitch-accent patterns and vocabulary that differ from Tokyo standard, and accuracy drops relative to standard Japanese. Tohoku and Kyushu dialects present further challenges. None of the major international engines currently offer dialect-specific models for Japanese regional varieties.

For content in strong Kansai-ben, plan for more editing time than you would for a Tokyo-standard lecture or business meeting. The core kanji and grammar structures will generally be right; the errors cluster around dialect-specific vocabulary and the pitch-accent-derived phoneme boundaries where the AI's training priors are less dense.

Which Engines Actually Support Japanese Well (2026)

Three engines have documented Japanese support with proper mixed-script output as of mid-2026:

Whisper Large-v3 (OpenAI): The most widely deployed model for Japanese transcription in open-source and hosted contexts. Outputs correct mixed-script Japanese. Known homophone error rate increases with smaller model variants (base, small, medium). Requires a language hint in the API call to avoid slower auto-detection. See Whisper pricing and model comparison for cost context.

Deepgram Nova-3: Japanese is included in the Nova-3 multilingual model. Deepgram's published documentation confirms Nova-3 handles mixed kana, kanji, and loanword pronunciation, and tracks syllabic rhythm for Japanese specifically. Available via the nova-3 general endpoint with language=ja.

AssemblyAI Universal-3 Pro: Supports Japanese in async transcription with speaker diarization verified (diarization for Japanese was added alongside Chinese, Hindi, Korean, and Vietnamese). The Universal-3 Pro model covers 99 languages. Real-time streaming support for Japanese is more limited than async as of mid-2026. See best speech-to-text APIs for a fuller engine comparison.

All three handle katakana loanwords, Japanese punctuation, and keigo verb forms correctly. The differentiation comes in homophone accuracy at the tail end of distribution, dialect robustness, and pricing model.

For speaker-labeled Japanese audio (meetings, interviews), see speaker diarization explained for how the underlying models approach Japanese turn-taking structure.

Comparing Tools for Japanese Transcription

Japanese formal conversation is more structured than English, with clearer speaker turns in formal settings. This generally helps diarization quality. Below is a comparison of tools that have documented Japanese support, with pricing current as of 2026-07.

Feature	Otter.ai	Notta	Trint	CATT
Japanese transcription	Yes (added late 2025)	Yes	Not documented	Yes
Japanese AI summary	Limited (newer feature)	Yes (listed language)	No	Yes
Free tier	300 min/mo	120 min/mo (5 min/file cap)	None (7-day trial, 3 files)	10 min/mo
Paid entry price	$16.99/mo (or $8.33/mo annual)	$13.99/mo (or $8.17/mo annual)	~$80/seat/mo (annual)	$9.99/mo annual
Diarization in Japanese	Yes	Yes	Yes	Yes

Notta is the clearest documented competitor for Japanese-language AI summaries and has Japanese as a primary target market. Otter expanded to Japanese in late 2025 and has native-language chat in Japanese. Trint is a strong journalism tool for English but lacks documented Japanese summary output.

If you just need a clean transcript and structured output without committing to a full meeting-bot ecosystem, ConvertAudioToText processes Japanese audio files directly without installing any browser extension.

Speaker Diarization for Japanese

Japanese formal conversation is turn-structured. Business meetings with keigo follow clear protocol: speakers wait for turn completion, especially when addressing superiors. This predictable turn structure generally helps diarization accuracy for formal Japanese audio compared to highly overlapping casual conversation.

Two-speaker formal interviews and one-on-one business calls tend to diarize most accurately. Larger group meetings (four or more speakers) with overlapping voices or multiple people at similar vocal register present the same challenges as in any language.

Per-microphone recording (separate audio tracks per speaker) improves diarization substantially for all engines. Most professional Japanese studio recordings and many business conference tools offer per-track output.

Tips for Better Japanese Transcription Output

Set the language explicitly to ja (Japanese). Auto-detection works but adds latency and occasionally misidentifies Japanese with heavy English loanwords as partially English.
Provide a glossary of proper nouns. Japanese names have multiple valid kanji readings and models guess from frequency. "Watanabe" can be 渡辺, 渡邊, or 渡部. Providing the correct form prevents consistent kanji substitution errors for recurring names.
Japanese has soft fricatives (し, ち, つ) that degrade in background noise more than voiced consonants do. Low-noise recordings improve kanji selection accuracy on phoneme-adjacent characters.
For podcast transcription specifically, output language matters for downstream use. Japanese podcast show notes and SEO perform better in native Japanese than in an English summary that requires manual translation.
For gijiroku (meeting minutes), plan for a light editorial pass on any sonkeigo-heavy segments, as the most formal honorific constructions still surface occasional model errors in multi-speaker overlap.

Frequently Asked Questions

Which AI engines output proper kanji in Japanese transcripts?

Whisper Large-v3, Deepgram Nova-3, and AssemblyAI Universal-3 Pro all produce mixed-script output with kanji, hiragana, and katakana. Older or smaller models sometimes fall back to all-hiragana output, which is technically readable but not how Japanese is actually written. If your tool returns phonetic-only output, switch to one of these three.

How does AI decide which kanji to use when multiple kanji share the same sound?

Context is the primary signal. The word "hashi" can mean bridge (橋), chopsticks (箸), or edge (端) depending on the surrounding words. Large models trained on enough Japanese text learn these collocations well. Smaller models make more homophone errors, especially on low-frequency kanji compounds. Providing a vocabulary glossary of names and technical terms reduces errors on terms the model may not have seen often.

Does AI transcription preserve keigo honorific forms?

AI transcription preserves what the speaker says, so sonkeigo and kenjougo verb forms appear in the transcript exactly as spoken. The risk is misidentification of the verb form itself, not substitution of a plain-form equivalent. Standard polite form (teineigo) is handled reliably. Complex honorific constructions in formal business or ceremonial speech may occasionally surface errors, so keigo-heavy content benefits from a light editorial pass.

How accurate is Japanese transcription for regional dialects like Kansai-ben?

Standard Tokyo Japanese (the register used in NHK broadcasts and most business settings) transcribes most accurately. Kansai dialect, with its distinct pitch-accent pattern and vocabulary, reduces accuracy compared to standard Japanese on most engines. Tohoku and Kyushu dialects present further challenges. No current engine has a Kansai-specific model. For dialect-heavy content, plan for more editing time than you would for standard Japanese.

What punctuation does a correct Japanese transcript use?

A correct Japanese transcript uses 、 (toten) as the comma, 。 (kuten) as the full stop, and 「」 for direct quotations. Western commas and periods in Japanese text are non-standard. If your tool returns "こんにちは, 今日は晴れです." with a Western comma and period, the punctuation is wrong. Well-supported engines produce the correct full-width Japanese punctuation automatically.

Can I get AI summaries in Japanese rather than English?

Not all tools produce native Japanese summaries. Notta supports Japanese-language summaries as a documented feature and has Japanese as one of its primary target markets. Otter expanded to Japanese in late 2025 but its summary depth in Japanese is newer than its core English offering. Trint does not produce Japanese summaries. If Japanese-language output for summaries and action items is a requirement, check vendor documentation before committing to a plan.

Sources

Otter.ai Japanese expansion announcement (Business Wire, October 2025): https://www.businesswire.com/news/home/20251022574971/en/Konnichiwa-Otter.ai-Expands-to-Japanese-Market-with-New-Japanese-Language-Support-for-AI-Meeting-Agent
Otter.ai pricing page: https://otter.ai/pricing
Notta pricing page: https://www.notta.ai/en/pricing
Notta language support documentation: https://support.notta.ai/hc/en-us/articles/4403155631131-What-languages-does-Notta-support
Trint pricing (Capterra): https://www.capterra.com/p/179896/Trint/pricing/
Deepgram models and languages overview: https://developers.deepgram.com/docs/models-languages-overview
Deepgram Nova-3 multilingual announcement: https://deepgram.com/learn/nova-3-multilingual-major-wer-improvements-across-languages
AssemblyAI supported languages: https://www.assemblyai.com/docs/supported-languages
AssemblyAI diarization in Japanese (Newsletter 37): https://www.assemblyai.com/blog/assemblyai-newsletter-37
OpenAI Whisper GitHub Japanese kanji discussion: https://github.com/openai/whisper/discussions/204
Kanji Dictionary Publishing Society on Japanese homophones: https://kanji.org/japanese/writing/japhom.htm
Gijiroku corporate culture: https://kimi.wiki/work/gijiroku
ConvertAudioToText pricing page: https://convertaudiototext.com/pricing

Try transcription free

Convert any audio or video to clean, unwatermarked text — speaker labels, timestamps, and AI summaries included. First 30 minutes free, no account.

transcriptionarabic

Arabic Transcription: MSA vs Dialects in ASR (2026 Guide)

How diglossia shapes Arabic speech-to-text accuracy. MSA vs Egyptian, Gulf, Levantine, and Maghrebi dialects: WER data, engine support, and script mechanics explained.

May 26, 202614 min

asian-languagestranscription

Best Transcription for Asian Languages in 2026

Verified guide to transcribing Mandarin, Japanese, Korean, Hindi, Thai, and Vietnamese audio in 2026. Covers CER benchmarks, engine strengths, and pricing by language.

May 26, 202610 min