§P-001Founder · Technical Lead

Metaphor — Japanese Conversation Tutor

2025 — Present

StackFlutter · Supabase · Claude (Sonnet + Haiku) · Whisper · ElevenLabs · FSRS-6

StatusLive · Active development

RoleFounder · Technical Lead

Demo

↗ Live · Demo

Adult Japanese learners hit the same wall. Vocab apps make you feel productive without making you conversational. Raw LLM chat is fluent but forgets you between sessions and has no pedagogical structure. Human tutors work, but cost $50/hr and need scheduling. Metaphor is the third option — a Japanese conversation tutor that gives you scenario-driven speaking practice on demand, remembers what you know across sessions, and corrects you in the moment.

You pick a scenario — cafe order, asking directions, self-introduction — and speak (or type) Japanese to AI characters who play through it with you. In-line error correction, vocabulary noticing, and grammar cards land mid-conversation. A per-element learner profile is rebuilt turn by turn behind the scenes, so the next session picks up exactly where the last one left off.

The architectural bet is that pedagogical decisions belong in a separate "Agenda" LLM call, not inside the conversation prompt. Each turn, a Haiku model assembles a single Markdown brief from the learner's profile snapshot, the active scenario, the character, and the latest analysis — and the Conversation LLM reads only that brief. Per-turn Agenda revision runs async with the character generation so it adds no user-facing latency.

Profile state spans Postgres element tables across vocabulary, grammar, kanji, kana, patterns, and discourse/pragmatic skills, all written from per-turn analysis blobs at session-end flush. FSRS-6 drives review scheduling on top; an EMA accuracy per element runs in parallel for the "skill" surface. Sonnet handles user-facing dialogue; Haiku 4.5 handles every synthesis call behind the scenes — error detection, response analysis, Agenda generation, post-session reflection — to keep per-turn cost realistic.

Highlights

01Agenda-LLM architecture: a single Markdown brief is the only context the Conversation LLM receives
02Two-model split — Sonnet for user-facing dialogue, Haiku 4.5 for every synthesis / analysis call behind the scenes
03Dual per-element proficiency: EMA accuracy for the skill feel, FSRS-6 for retrievability and review scheduling
04Closed-enum controlled vocabularies enforced prompt-side AND in the parser, plus a Japanese-codepoint validator on lemmas
05Dual-provider STT chain (Whisper → ElevenLabs scribe_v2) with codepoint validation so neither provider can silently translate out of language
06RLS-per-user on every user-state table; SECURITY DEFINER trigger bootstraps the profile row on signup — no service-role in the live app
07Validated against three persona expectation suites via an auto-converse harness

Architecture

A single learner turn fans out into parallel work. The mic captures PCM 16-bit @ 16 kHz mono; transcription tries Whisper first then falls back to ElevenLabs scribe_v2, validated against Japanese codepoints either way. Once the message is in, two LLM calls fire in parallel — Haiku checks for errors (max_tokens 500) while Sonnet generates the character response (max_tokens 512). The error correction UI animates during the Sonnet wait.

After the character speaks, Haiku runs response analysis (max_tokens 1024–2048) and an async Agenda revision (max_tokens 6000). All session-state writes are batched into a single flush at session end — that's where FSRS-6 review is applied, the element tables get upserted, and tracked-error transitions get promoted.

The whole stack is a Flutter client talking directly to Supabase Postgres, Anthropic, OpenAI (Whisper), and ElevenLabs. No custom backend service.

Engineering decisions

01
Agenda LLM as the sole input to the Conversation LLM
The Conversation LLM doesn't see the profile snapshot, scenario data, or analysis blobs directly. It sees one Markdown brief, regenerated each turn by Haiku. Two payoffs: debugging collapses to reading one file, and the conversation model's job becomes "execute this brief" instead of synthesizing context every turn. Revision runs async with the character call so it never adds user-facing latency.
02
Sonnet for user-facing dialogue, Haiku 4.5 for everything else
Error detection, response analysis, Agenda generation, post-session reflection — all Haiku. Only the character response runs on Sonnet. Per-turn cost stays realistic, and validation against the persona suites confirmed Haiku is sufficient for the structured-output tasks once the prompts are tight.
03
Dual proficiency tracking: EMA + FSRS-6
Per-element EMA accuracy gives the "skill" feel learners expect. FSRS-6 stability/difficulty gives retrievability for scheduling. Both run side by side per element and both surface in the profile snapshot.
04
Controlled vocabularies enforced prompt-side AND code-side
Grammar concepts, function ids, discourse skills, pragmatic skills, and tracked-error pattern ids are all closed enums. The LLM is told the list in its prompt, and the parser filters anything outside the set before it touches Postgres. Without this dual gate the element tables get polluted with invented concept names and English descriptive ids the moment output lengths grow.

Notable challenges

01
Controlled-vocabulary hallucination at long output lengths
Once the analysis prompt grew to ~3000-token output, the LLM started inventing concept names and emitting English descriptive ids for Japanese lexical items. Fix: dual filter — closed enums in the prompt, static Set lookups in the parser, plus a codepoint check that scans for hiragana/katakana/CJK. Out-of-set values are silently dropped before reaching Postgres.
02
Japanese STT has two failure modes, not one
Whisper with language='ja' will sometimes translate spoken English into Japanese instead of returning the English. ElevenLabs scribe_v2 with language_code='ja' will sometimes return English for ambiguous audio. Fix: dual-provider chain that validates output against Japanese codepoints and falls through to the other provider with auto-detect if validation fails.
03
max_tokens silently truncating JSON mid-response
The analysis call started truncating mid-JSON once the prompt picked up pattern-detection guidance, and a naive brace matcher returned null. Fix: surface truncation in the parse path, and a hand-rolled JSON extractor that tolerates code fences, prose preambles, and trailing junk by tracking brace depth and string state.

← All projects Home →

Metaphor — Japanese Conversation Tutor

Demo

Highlights

Architecture

Engineering decisions

Agenda LLM as the sole input to the Conversation LLM

Sonnet for user-facing dialogue, Haiku 4.5 for everything else

Dual proficiency tracking: EMA + FSRS-6

Controlled vocabularies enforced prompt-side AND code-side

Notable challenges

Controlled-vocabulary hallucination at long output lengths

Japanese STT has two failure modes, not one

max_tokens silently truncating JSON mid-response