Anatomy of a Live Voice Loop¶
STT/agent/TTS design, security boundary, cost guards — Part 2 of a 4-part series on building a real-time voice AI coach into Wattlog.pro.
Part 1 covered why I built a bidirectional voice loop instead of one-way TTS. This part is the architecture: how a rider's spoken question, asked mid-workout, turns into a spoken answer that references a real, live number.
The shape of one voice turn¶
Rider push-to-talk clip
│ POST /api/coach/voice/stt (batch STT)
▼
transcript
│ POST /api/coach/chat-style agent turn (spotter loop)
│ ├─ inline live-session snapshot injected into context
│ └─ get_live_session_state tool available for anything the
│ snapshot doesn't cover
▼
reply text
│ POST /api/coach/voice/tts (streaming TTS)
▼
audio chunks → shared AudioContext → rider hears it
Two design decisions here are the ones I'd defend hardest in an interview.
Decision 1: the LLM never gets a session_id from the client¶
The agent's live-telemetry tool, get_live_session_state, takes no
session identifier as an argument. The obvious naive design — "the client
sends its session ID, the tool reads that session" — is a cross-tenant data
read waiting to happen: nothing stops a malicious or buggy client from
passing someone else's session ID and reading their live power/HR data.
Instead, the tool resolves state server-side, from the authenticated JWT
user, through a registry keyed by (user_id, athlete_id):
# services live-session registry — resolution, not client input
registry.get((user_id, athlete_id))
The trust boundary sits at authentication, not at a request parameter. This is the same instinct as "never trust a client-supplied user ID in a URL" — just applied to an LLM tool call instead of a REST endpoint. LLM tool arguments are still attacker-influenceable input; I treat them exactly like any other untrusted client input.
Decision 2: inline snapshot first, tool call only as fallback¶
The initial instinct is "the coach has a tool, so every voice question triggers a tool call to fetch live state." I measured that instead of assuming it, and it was the wrong default: most spotter questions ("how much longer?", "what's my power?") need the same handful of fields every time.
So a compact live snapshot — current power, HR, step, plan, time remaining —
gets injected directly into the turn's context up front. The
get_live_session_state tool stays registered for anything deeper, but the
common case now answers in one LLM round-trip instead of two. That one
change alone measurably cut latency, and it came from watching real
tool_calls counts in production logs rather than guessing.
One subtlety worth calling out: the snapshot's "time remaining" isn't the raw clock value. The whole voice pipeline (STT + LLM + TTS) takes real time — by the moment the rider hears "twenty-five seconds," a naive snapshot value would already be stale by however long the pipeline took to run. The snapshot subtracts a measured latency offset before it's spoken, so the number the rider hears matches the number on screen, not the number that was true when the turn started.
Decision 3: reuse the existing agent, don't fork it¶
There was already a working, tool-using coach agent handling text chat —
with its own careful [SUGGEST] / propose_workout parsing for "would you
like to start this workout?" flows. It would have been faster, short-term,
to copy that loop and hack in voice-specific behavior. I didn't, for one
concrete reason: the voice coach needed a completely different persona
from the text coach — terse, spoken-language replies (no Markdown, no bullet
lists, ideally under ~12 words), advice-only (no proposing workouts
mid-interval), and it needed to stay that way independent of whatever the
text coach's prompt evolved into.
So the voice path got its own system prompt (a "spotter" persona) and its own tool subset, but shares the same bounded tool-loop primitives underneath. One shared reasoning engine, two personas, zero duplicated control flow. If I'd forked the whole agent, a bug fix to the loop (say, in how it detects a refusal) would silently apply to only one of the two paths — the kind of drift that shows up as a confusing production bug three months later.
Cost is a first-class design constraint, not an afterthought¶
This is a solo, self-funded project. An uncapped voice AI feature is a denial-of-wallet vector, not just a technical risk. Every voice turn:
- requires a registered, verified user (no guest-account abuse) —
enforced by a shared FastAPI dependency (
resolve_coached_athlete) used by chat, STT, and TTS endpoints alike, so the gate can't be forgotten on one path, - counts against the same atomic daily quota as text chat, via
check_and_increment_quota, - is capped on input (audio clip duration for STT) and output (reply character count fed to TTS),
- is logged with a cost line item (
cost_log) so STT/LLM/TTS spend is auditable per turn, not just inferred from a monthly bill.
And critically: barge-in has to stop the meter, not just stop the
speaker. If a rider interrupts the coach mid-sentence, the audio has to
stop and the upstream ElevenLabs stream has to actually close — otherwise
you're paying to generate speech nobody hears. That's a StreamingResponse
+ is_disconnected teardown on the server, triggered by an AbortController
on the client. It sounds like a UX nicety. It's actually a billing
correctness requirement.
Provider choice: ElevenLabs, and why not build my own TTS¶
I picked ElevenLabs for both speech-to-text (their Scribe model) and text-to-speech (Flash v2.5) for three concrete reasons: the TTS model is genuinely low-latency (~75ms to first audio, which keeps TTS off the critical path — see Part 3 for why that matters), it's multilingual out of the box (the app already ships four languages: English, Polish, German, Dutch, and the coach has to reply in the rider's language), and — this one matters for a portfolio decision — building your own STT/TTS is not the interesting problem here. The interesting problem is the system around a good provider: the streaming pipeline, the cost guards, the barge-in teardown, the tool-calling over live state. I said no to a lower-leverage yak-shave so I could spend the effort on the part that actually differentiates the project.
Both paid calls run server-side only — the ElevenLabs API key never reaches the client. The endpoints are thin proxies with auth, quota, and caps wrapped around them, which is the only architecture I'd consider defensible for a client that's a public web/mobile app.
What this bought me — and what it cost¶
The result: a rider mid-interval can hold a button, ask "how much longer on this?", and hear back a real number pulled from the actual running session — not a canned response, not a hallucinated guess. Barge-in works: talking over the coach cuts its audio and its bill in the same motion.
What it didn't buy me, yet, was speed. The first working version of this pipeline took 5-6 seconds from "rider stops talking" to "first word heard." For something you're supposed to shout mid-interval without breaking cadence, that's not good enough. Part 3 is the story of chasing that number down — and the honest, unglamorous conclusion about exactly where the time was actually going.
This is Part 2 of a 4-part series. Part 1: Why I Built a Voice AI Coach. Part 3: The Latency Engineering. Part 4: What Shipped, What Broke, What's Next.