--:--:-- PST
§P-001Founder · Technical Lead
← All projects

Metaphor — Japanese Conversation Tutor

2025 — Present

StackFlutter · Supabase · Claude (Sonnet + Haiku) · Whisper · ElevenLabs · FSRS-6
StatusLive · Active development
RoleFounder · Technical Lead

Demo

↗ Live · Demo

Adult Japanese learners hit the same wall. Vocab apps make you feel productive without making you conversational. Raw LLM chat is fluent but forgets you between sessions and has no pedagogical structure. Human tutors work, but cost $50/hr and need scheduling. Metaphor is the third option — a Japanese conversation tutor that gives you scenario-driven speaking practice on demand, remembers what you know across sessions, and corrects you in the moment.

You pick a scenario — cafe order, asking directions, self-introduction — and speak (or type) Japanese to AI characters who play through it with you. In-line error correction, vocabulary noticing, and grammar cards land mid-conversation. A per-element learner profile is rebuilt turn by turn behind the scenes, so the next session picks up exactly where the last one left off.

The architectural bet is that pedagogical decisions belong in a separate "Agenda" LLM call, not inside the conversation prompt. Each turn, a Haiku model assembles a single Markdown brief from the learner's profile snapshot, the active scenario, the character, and the latest analysis — and the Conversation LLM reads only that brief. Per-turn Agenda revision runs async with the character generation so it adds no user-facing latency.

Profile state spans Postgres element tables across vocabulary, grammar, kanji, kana, patterns, and discourse/pragmatic skills, all written from per-turn analysis blobs at session-end flush. FSRS-6 drives review scheduling on top; an EMA accuracy per element runs in parallel for the "skill" surface. Sonnet handles user-facing dialogue; Haiku 4.5 handles every synthesis call behind the scenes — error detection, response analysis, Agenda generation, post-session reflection — to keep per-turn cost realistic.

Highlights

  • 01Agenda-LLM architecture: a single Markdown brief is the only context the Conversation LLM receives
  • 02Two-model split — Sonnet for user-facing dialogue, Haiku 4.5 for every synthesis / analysis call behind the scenes
  • 03Dual per-element proficiency: EMA accuracy for the skill feel, FSRS-6 for retrievability and review scheduling
  • 04Closed-enum controlled vocabularies enforced prompt-side AND in the parser, plus a Japanese-codepoint validator on lemmas
  • 05Dual-provider STT chain (Whisper → ElevenLabs scribe_v2) with codepoint validation so neither provider can silently translate out of language
  • 06RLS-per-user on every user-state table; SECURITY DEFINER trigger bootstraps the profile row on signup — no service-role in the live app
  • 07Validated against three persona expectation suites via an auto-converse harness

Architecture

A single learner turn fans out into parallel work. The mic captures PCM 16-bit @ 16 kHz mono; transcription tries Whisper first then falls back to ElevenLabs scribe_v2, validated against Japanese codepoints either way. Once the message is in, two LLM calls fire in parallel — Haiku checks for errors (max_tokens 500) while Sonnet generates the character response (max_tokens 512). The error correction UI animates during the Sonnet wait.

After the character speaks, Haiku runs response analysis (max_tokens 1024–2048) and an async Agenda revision (max_tokens 6000). All session-state writes are batched into a single flush at session end — that's where FSRS-6 review is applied, the element tables get upserted, and tracked-error transitions get promoted.

The whole stack is a Flutter client talking directly to Supabase Postgres, Anthropic, OpenAI (Whisper), and ElevenLabs. No custom backend service.

Engineering decisions

  • 01

    Agenda LLM as the sole input to the Conversation LLM

    The Conversation LLM doesn't see the profile snapshot, scenario data, or analysis blobs directly. It sees one Markdown brief, regenerated each turn by Haiku. Two payoffs: debugging collapses to reading one file, and the conversation model's job becomes "execute this brief" instead of synthesizing context every turn. Revision runs async with the character call so it never adds user-facing latency.

  • 02

    Sonnet for user-facing dialogue, Haiku 4.5 for everything else

    Error detection, response analysis, Agenda generation, post-session reflection — all Haiku. Only the character response runs on Sonnet. Per-turn cost stays realistic, and validation against the persona suites confirmed Haiku is sufficient for the structured-output tasks once the prompts are tight.

  • 03

    Dual proficiency tracking: EMA + FSRS-6

    Per-element EMA accuracy gives the "skill" feel learners expect. FSRS-6 stability/difficulty gives retrievability for scheduling. Both run side by side per element and both surface in the profile snapshot.

  • 04

    Controlled vocabularies enforced prompt-side AND code-side

    Grammar concepts, function ids, discourse skills, pragmatic skills, and tracked-error pattern ids are all closed enums. The LLM is told the list in its prompt, and the parser filters anything outside the set before it touches Postgres. Without this dual gate the element tables get polluted with invented concept names and English descriptive ids the moment output lengths grow.

Notable challenges

  • 01

    Controlled-vocabulary hallucination at long output lengths

    Once the analysis prompt grew to ~3000-token output, the LLM started inventing concept names and emitting English descriptive ids for Japanese lexical items. Fix: dual filter — closed enums in the prompt, static Set lookups in the parser, plus a codepoint check that scans for hiragana/katakana/CJK. Out-of-set values are silently dropped before reaching Postgres.

  • 02

    Japanese STT has two failure modes, not one

    Whisper with language='ja' will sometimes translate spoken English into Japanese instead of returning the English. ElevenLabs scribe_v2 with language_code='ja' will sometimes return English for ambiguous audio. Fix: dual-provider chain that validates output against Japanese codepoints and falls through to the other provider with auto-detect if validation fails.

  • 03

    max_tokens silently truncating JSON mid-response

    The analysis call started truncating mid-JSON once the prompt picked up pattern-detection guidance, and a naive brace matcher returned null. Fix: surface truncation in the parse path, and a hand-rolled JSON extractor that tolerates code fences, prose preambles, and trailing junk by tracking brace depth and string state.