Doctolingo
iOS-native Spanish pronunciation coach for clinicians — on-device transcript via SFSpeechRecognizer plus per-phoneme scoring via Azure, offline after first launch.
Overview
Doctolingo is an iOS-native pronunciation coach for clinicians learning to speak medical Spanish with patients. It pairs Apple's on-device SFSpeechRecognizer (free, offline transcript) with Azure Pronunciation Assessment (per-phoneme accuracy scores) to give word-level heatmaps, a simplified Leitner SRS, and a side-by-side playback loop — all bundled offline-first on an iPhone.
Recently Shipped
- Settings wired end-to-end: daily goal, playback speed, haptics toggle, reset progress, bundle version
- Inline per-phrase trend card with last-20-attempts sparkline on RecordView
- On-device phoneme score via custom Castilian G2P (SFSpeech + IPA Levenshtein)
- Side-by-side playback: native speaker vs your own recording
- Azure Pronunciation Assessment — per-phoneme score display with word-level heatmap
- History view with per-phrase sparklines across all attempts
- Persistent attempt storage via GRDB over SQLite
- Simplified Leitner SRS (3 boxes; ≥80 promotes, <80 demotes to Box 1)
- Streak counter + configurable daily goal + haptic feedback on score arrival
- Pre-generated reference audio (~200 WAVs, es-ES-AlvaroNeural) bundled for offline-first
- Clinical scenarios: chest-pain intake, abdominal-pain intake, headache intake (~50 phrases each)
- TestFlight beta distribution via Xcode Cloud with Discord build-status webhook
Architecture
Pure SwiftUI app, iOS 17+. No backend in V1 — everything runs on-device or against Azure's REST endpoint. Audio capture uses AVAudioEngine with a single .playAndRecord session activated lazily at point-of-use (activating in App.init() fails on real devices — UIApplication lifecycle isn't ready). Scored attempts persist to a local SQLite DB via GRDB with UUID primary keys, created_at / updated_at / deleted_at columns, and user_id = 'local' until Clerk auth ships in V2.
Scoring Pipeline
Two scorers run per recording. On-device (Apple): SFSpeechRecognizer transcribes the attempt; a custom Castilian G2P (SpanishG2P.swift, ~25 rules — ch→tʃ, ll→ʝ, rr→r, qu/gue/gui, güe/güi→ɡw, ce/ci→θ, ge/gi→x, ñ→ɲ, word-initial r vs flap ɾ) converts both reference and transcript to IPA, then computes (1 − PER) × 100 via phoneme Levenshtein. Free, instant, offline. Cloud (Azure): The WAV is sent to Azure's REST Pronunciation Assessment endpoint; response is decoded with flat-key custom CodingKeys (Azure's REST shape differs from the SDK's nested shape — AccuracyScore, FluencyScore, PronScore sit directly on each NBest entry, not under a PronunciationAssessment sub-object). Both scores are stored per attempt to enable a later correlation analysis — the goal is to phase out paid Azure once SFSpeech+G2P correlates tightly enough.
Content & Audio Pipeline
Reference audio is pre-generated offline via voicebox.sh (edge-tts + Microsoft neural Spanish voices like es-ES-AlvaroNeural), bundled in the app's content/es/audio/ directory as ~200 WAVs. No per-user TTS cost; no network required after install. Each phrase carries an id, en, es, and category. V1 ships 3 scenarios × ~50 phrases each: chest-pain intake, abdominal-pain intake, headache intake — all written against USMLE OPQRST vignettes (not Canopy phrases, to avoid any derivative-content argument).
SRS & Progression
Simplified Leitner with 3 boxes — Box 1 (due +1 day), Box 2 (+3 days), Box 3 mastered (+30 days). Scoring ≥ 80 promotes to the next box; < 80 drops back to Box 1. Persisted as JSON in UserDefaults (demo-scale). StreakStore tracks current streak + today's phrase count + daily goal; streak increments when last-practice was yesterday, resets after a full day missed, normalizes on foreground. Haptic feedback (UIImpactFeedbackGenerator) fires on score arrival with style .medium on pass, .rigid on fail — tactile delta before the user reads the number.
Performance & Build
Build number is derived from git rev-list --count HEAD via scripts/bump.sh so every commit that ships is monotonically higher in TestFlight. PerfLog.swift times every hot path (audio session activation, AVAudioPlayer init, reference-audio bundle lookups, Azure roundtrip) — critical for catching Debug-build slowness, since Swift's synthesized Codable decodes JSON ~50–100× slower in Debug than Release and can starve the main thread at launch. Reference-audio URL lookups are cached per phrase (150 phrases × many SwiftUI re-renders × recursive bundle walks added up to visible lag).
Privacy Posture
SFSpeech transcripts never leave the device. Recorded WAVs are sent to Azure solely for scoring and are not persisted by Microsoft per their Speech Service retention policy. The app stores attempts (transcripts, scores, timestamps) locally in SQLite at Application Support/Doctolingo/doctolingo.sqlite — explicitly outside Documents/ so recordings don't surface in iTunes file sharing. No analytics, no crash reporting, no third-party SDKs in V1.
What's Explicitly Out of V1
No Android. No free-translation drills, grammar exercises, or vocabulary flashcards — V1 is pronunciation-only. No social features, leaderboards, or friends. No AI-patient role-play (captured as V2). No Mandarin, Arabic, Vietnamese — Spanish-only ship. No Supabase sync, no RevenueCat subscription gating, no voice profile enrollment. Scope discipline is the feature; Canopy is enterprise-bloated and Duolingo teaches you to order coffee, not to ask about chest pain character.