What Shipped, What Broke, What's Next¶

Proactive vs. reactive split, honest failures, the hiring-manager pitch — Part 4 of a 4-part series on building a real-time voice AI coach into Wattlog.pro.

Part 1, Part 2, and Part 3 covered why I built a bidirectional voice loop, how it's architected, and the latency chase. This closing part is the honest retrospective: what actually shipped, the proactive half of the feature I haven't mentioned yet, what's still broken, and what I'd tell a hiring manager if they asked "what would you do differently."

Two voice features, not one¶

Everything so far described the reactive path: rider presses a button, asks a question, gets a spoken answer. There's a second, cheaper, and in some ways more interesting half of this: proactive voice — the coach speaking training cues on its own, without being asked.

I split these deliberately, because they have almost opposite cost and risk profiles:

Reactive Q&A is unbounded — the LLM generates novel text every time, so every turn costs real money and has real (if now-optimized) latency.
Proactive cues (hydration reminders, "heart rate too high, ease off", interval transitions) are a known, small, fixed set of messages. So they don't need to be synthesized live at all. I pre-rendered them once, offline, across all four supported languages, and committed the resulting mp3 files straight into the repo. At runtime, playing a hydration cue is a static file fetch — zero LLM cost, zero TTS cost, zero latency, works even if ElevenLabs is down.

The one exception: user-authored TrainerMessages (custom text a rider or coach attaches to a specific point in a planned workout) are arbitrary, unknown-at-build-time text, so those do need a live, gated TTS call — but even that's cached by a content hash of the message text, so re-riding the same planned workout a second time doesn't re-pay ElevenLabs for identical audio.

The design principle underneath: push variability to build-time wherever the actual content isn't variable. A rider's HR going out of a training zone is not a novel event that needs a fresh LLM-generated sentence — it's one of exactly two states (too high / too low), known in advance, in four languages. Treating it like a live-generation problem would have been needless cost and an unnecessary failure surface for something that doesn't require intelligence, just a lookup.

One deliberately opinionated call here: the spoken heart-rate cue omits the actual bpm number ("heart rate too high, ease off" — not "heart rate is 172, ease off"), specifically because it's pre-rendered and numbers can't be. The visual toast on screen keeps the exact number. A rider mid-effort needs "ease off," not a number they'd have to do math on anyway.

What broke, or almost did¶

Two things are worth naming honestly, because "nothing went wrong" isn't a credible engineering story:

The desktop MediaSource fallback (covered in Part 3) is a silent degradation, not a crash. It doesn't throw an error, doesn't log a warning by default, and the feature still works — it's just slower on one platform than the code appears to promise. That's the most dangerous kind of bug: the kind that passes every test that isn't specifically checking platform behavior, and would have shipped invisibly if I'd only tested in Chrome.

Barge-in cost-teardown needed to be treated as a correctness requirement, not a UX nicety, from day one. It would have been easy to ship "stop playing audio when the rider interrupts" as a purely client-side behavior — stop the speaker, and technically the feature "works." The bug that approach hides is a live upstream TTS (and, in the next iteration, an LLM) connection still generating content and accruing cost after the rider has stopped caring. I treated every barge-in path as a billing-correctness test case, not just a UX one, and it changed the actual server implementation (structured-concurrency teardown, not a client flag).

What's genuinely deferred, and why that's a defensible call¶

Mobile (Capacitor/iOS/Android). The reactive push-to-talk path is currently web/desktop-only. Getting getUserMedia/MediaRecorder mic capture working inside a WKWebView, while an AVAudioSession category switch doesn't suspend the shared AudioContext that also drives the workout's beep cues, is its own real spike — and the demo GIF for the article looks identical whether it's recorded on desktop or a phone. I scoped it out of v1 on purpose rather than let it balloon the timeline; it's documented as a follow-up with the specific risk called out (Android first — Chromium's getUserMedia support is much closer to desktop; iOS is the hard one).
LLM token-streaming into TTS (Part 3) — parked behind an explicit go/no-go measurement step, not shipped speculatively.
Always-listening / wake-word / VAD — deliberately out of scope. It solves a UX problem ("don't make me hold a button") I didn't have evidence anyone needed solved, at the cost of a much harder false-positive problem (the coach shouldn't start listening because you coughed).

Saying "out of scope, here's why, here's the documented follow-up" is itself part of the engineering story I wanted to be able to tell. A feature list with no boundaries usually means the boundaries weren't thought about, not that there aren't any.

What I'd tell a hiring manager¶

If someone asked me to summarize this project in the length of an elevator ride:

"I built a bidirectional voice interface into a live training session — speech-to-text, a tool-calling LLM agent reading real-time sensor state server-side with strict per-user isolation, and streaming text-to-speech, under a hard cost budget on a project with no revenue. I measured the pipeline before optimizing it, found the LLM completion (not the network) was the actual bottleneck, cut end-to-end latency from ~5-6s to ~2.5-3s with three targeted changes, and found — by testing on the real deployment target, not just Chrome — a platform-specific gap that only shows up on WebKit. I split the feature into a cheap, pre-rendered proactive path and an expensive, quota-gated reactive path, because treating a fixed message set like a live-generation problem would have been wasted cost. And I can tell you exactly what I chose not to build yet, and why, with the risk analysis that led to that call."

That's a longer sentence than a headhunter wants, but it's the true one — and it's the one that survives a follow-up question. The whole point of writing this series instead of just shipping the feature quietly was to make that follow-up question easy to answer well.

This is Part 4 of a 4-part series. Part 1: Why I Built a Voice AI Coach. Part 2: Anatomy of a Live Voice Loop. Part 3: The Latency Engineering.