Why I Built a Voice AI Coach¶

Problem framing, portfolio angle, three architecture options — Part 1 of a 4-part series on building a real-time voice AI coach into Wattlog.pro.

I run a solo-built cycling training platform — think a leaner, self-hosted alternative to TrainingPeaks/Zwift, with live ANT+/BLE sensor ingestion, a training-load model (CTL/ATL/TSB), structured workouts driving a smart trainer in ERG mode, and an LLM-backed coach you can chat with about your plan. Zero funding, zero employees, real users, real hardware in my living room.

At some point I wanted to add voice: let a rider keep their eyes on the road (or the power meter) instead of the screen, and just talk to the coach mid-workout.

The first version of that idea was almost embarrassingly small: bolt text-to-speech onto the beep system that already told riders "interval starting" or "drink water." Ship it, move on.

I nearly built that. Then I asked myself the question that actually matters before writing code: who is this for, and what does "done" prove?

The honest answer¶

There is no registered user base clamoring for spoken beeps. The app is public with effectively no monetization. What I actually wanted — and said out loud once I stopped dressing it up — was:

"It was just to build my personal portfolio, that I can include this kind of AI agent that can 'speak' and understand the rider over the training."

Not "ship a feature." Prove, to an AI/ML or full-stack hiring manager reading my repo, that I can build real-time, voice-interactive AI systems — not that I can call a TTS API once.

That reframing changed everything downstream. A one-way "coach speaks at you" system is a nice-to-have. It is also, technically, boring: call an API, play an mp3. It would not survive five minutes of questions in an interview about "tell me about a hard system you built."

The interesting engineering problem — the one worth an article, and worth defending in an interview — was different:

Can the rider talk back? Can the coach understand live telemetry (current power, heart rate, which interval, seconds remaining) and answer a real spoken question with a real number, fast enough that it doesn't feel broken mid-effort?

That's not a TTS integration. That's speech-to-text, an LLM agent with live tool access to a running session, streaming text-to-speech, and a barge-in/interrupt model — stitched into a sub-2-second budget, on a cost model that can't bankrupt a project with no revenue.

Three ways to build it, and why I picked the hard one¶

I laid out three real architectures before touching code:

A — Streaming one-way narration. The coach speaks short lines triggered by workout events (interval start, HR drift), streamed into TTS so audio starts before the sentence is even fully generated. Low risk, low effort. Article-worthy for "first word in under 500ms," but it's still "smart beeps." It doesn't prove the coach understands the rider — because the rider never says anything.

B — Full bidirectional voice loop (what I built). Rider speaks → speech-to-text → the coach's existing tool-using LLM agent, now with a live-telemetry tool → streaming text-to-speech, with barge-in so the rider can interrupt mid-sentence. Every hop is mine to design, measure, and defend. Highest effort, highest risk, and the only option that actually delivers on "speaks and understands."

C — A realtime speech-to-speech API (OpenAI Realtime, Gemini Live) that handles STT+reasoning+TTS as one black box, fed live cycling data via function calls. Interesting, but it would have made the article about integrating an API, not about the latency and concurrency engineering I actually wanted on my resume. I noted it as "what I'd try next" instead — which turned out to be the right call, and something I can talk about knowledgeably in an interview without having built it.

I chose B. Not because it was the "portfolio-optimal" choice in some cynical sense, but because it was the only option where the interesting failure modes — barge-in cost leaks, streaming an LLM into a TTS websocket, WebKit's refusal to progressively play MP3 — were mine to actually solve.

What I already had, and what was genuinely new¶

I didn't start from zero. There was already a working, tool-using text coach (POST /api/coach/chat) with cost/abuse guards, an atomic daily quota, and a bounded agent loop calling read-only tools like get_training_plan. Reusing that agent — rather than building a second one for voice — was a deliberate constraint I set myself: the new work was wiring the existing brain to a live microphone and a live data feed, not reinventing the reasoning layer.

What was genuinely new, and where the real risk lived: - Microphone capture in the browser (getUserMedia/MediaRecorder) — nothing like it existed in the codebase yet. - Server-side streaming with client-disconnect teardown (StreamingResponse + is_disconnected polling in FastAPI) — needed for barge-in to actually stop billing the moment a rider interrupts. - A tool that resolves "what is this rider's power/HR/step/plan right now" from server-side session state — scoped strictly to the authenticated user, with no client-supplied session ID, closing an obvious cross-tenant read.

I scoped v1 deliberately narrow: web/desktop only (no mobile yet — WKWebView and Capacitor's audio stack is its own can of worms, covered in a later part), push-to-talk instead of always-listening (so speech-to-text gets one clean short clip instead of needing to stream), and no fixed latency target pulled from thin air — I measured the real bottleneck first and reported the honest number instead of hitting an arbitrary one.

That last point matters more than it sounds. It's easy to promise "sub-2-second voice AI" in a plan doc. It's harder, and more honest, to say: here's where the time actually goes, here's the bottleneck, here's what I did about it, and here's what's still open. Part 3 of this series is that latency story — the LLM completion, not the network, turned out to be the villain, and shaving it down (~5-6s → ~2.5-3s server-side) is where most of the real engineering happened.

Part 2 covers the architecture I landed on: the tool-calling agent with live-session access, the streaming TTS pipeline, and the cost/abuse guards that keep a solo-funded project from an ElevenLabs bill it can't pay.

This is Part 1 of a 4-part series on building a real-time, bidirectional voice AI coach into a production cycling training platform. Part 2: Anatomy of a Live Voice Loop. Part 3: The Latency Engineering. Part 4: What Shipped, What Broke, What's Next.