Skip to content

Anatomy of a Live Voice Loop

STT/agent/TTS design, security boundary, cost guards — Part 2 of a 4-part series on building a real-time voice AI coach into Wattlog.pro.


Part 1 covered why I built a bidirectional voice loop instead of one-way TTS. This part is the architecture: how a rider's spoken question, asked mid-workout, turns into a spoken answer that references a real, live number.

The shape of one voice turn

 Rider push-to-talk clip
     POST /api/coach/voice/stt   (batch STT)
   
 transcript
     POST /api/coach/chat-style agent turn (spotter loop)
       ├─ inline live-session snapshot injected into context
       └─ get_live_session_state tool available for anything the
          snapshot doesn't cover
   
 reply text
     POST /api/coach/voice/tts   (streaming TTS)
   
 audio chunks  shared AudioContext  rider hears it

Two design decisions here are the ones I'd defend hardest in an interview.

Decision 1: the LLM never gets a session_id from the client

The agent's live-telemetry tool, get_live_session_state, takes no session identifier as an argument. The obvious naive design — "the client sends its session ID, the tool reads that session" — is a cross-tenant data read waiting to happen: nothing stops a malicious or buggy client from passing someone else's session ID and reading their live power/HR data.

Instead, the tool resolves state server-side, from the authenticated JWT user, through a registry keyed by (user_id, athlete_id):

# services live-session registry — resolution, not client input
registry.get((user_id, athlete_id))

The trust boundary sits at authentication, not at a request parameter. This is the same instinct as "never trust a client-supplied user ID in a URL" — just applied to an LLM tool call instead of a REST endpoint. LLM tool arguments are still attacker-influenceable input; I treat them exactly like any other untrusted client input.

Decision 2: inline snapshot first, tool call only as fallback

The initial instinct is "the coach has a tool, so every voice question triggers a tool call to fetch live state." I measured that instead of assuming it, and it was the wrong default: most spotter questions ("how much longer?", "what's my power?") need the same handful of fields every time.

So a compact live snapshot — current power, HR, step, plan, time remaining — gets injected directly into the turn's context up front. The get_live_session_state tool stays registered for anything deeper, but the common case now answers in one LLM round-trip instead of two. That one change alone measurably cut latency, and it came from watching real tool_calls counts in production logs rather than guessing.

One subtlety worth calling out: the snapshot's "time remaining" isn't the raw clock value. The whole voice pipeline (STT + LLM + TTS) takes real time — by the moment the rider hears "twenty-five seconds," a naive snapshot value would already be stale by however long the pipeline took to run. The snapshot subtracts a measured latency offset before it's spoken, so the number the rider hears matches the number on screen, not the number that was true when the turn started.

Decision 3: reuse the existing agent, don't fork it

There was already a working, tool-using coach agent handling text chat — with its own careful [SUGGEST] / propose_workout parsing for "would you like to start this workout?" flows. It would have been faster, short-term, to copy that loop and hack in voice-specific behavior. I didn't, for one concrete reason: the voice coach needed a completely different persona from the text coach — terse, spoken-language replies (no Markdown, no bullet lists, ideally under ~12 words), advice-only (no proposing workouts mid-interval), and it needed to stay that way independent of whatever the text coach's prompt evolved into.

So the voice path got its own system prompt (a "spotter" persona) and its own tool subset, but shares the same bounded tool-loop primitives underneath. One shared reasoning engine, two personas, zero duplicated control flow. If I'd forked the whole agent, a bug fix to the loop (say, in how it detects a refusal) would silently apply to only one of the two paths — the kind of drift that shows up as a confusing production bug three months later.

Cost is a first-class design constraint, not an afterthought

This is a solo, self-funded project. An uncapped voice AI feature is a denial-of-wallet vector, not just a technical risk. Every voice turn:

  • requires a registered, verified user (no guest-account abuse) — enforced by a shared FastAPI dependency (resolve_coached_athlete) used by chat, STT, and TTS endpoints alike, so the gate can't be forgotten on one path,
  • counts against the same atomic daily quota as text chat, via check_and_increment_quota,
  • is capped on input (audio clip duration for STT) and output (reply character count fed to TTS),
  • is logged with a cost line item (cost_log) so STT/LLM/TTS spend is auditable per turn, not just inferred from a monthly bill.

And critically: barge-in has to stop the meter, not just stop the speaker. If a rider interrupts the coach mid-sentence, the audio has to stop and the upstream ElevenLabs stream has to actually close — otherwise you're paying to generate speech nobody hears. That's a StreamingResponse + is_disconnected teardown on the server, triggered by an AbortController on the client. It sounds like a UX nicety. It's actually a billing correctness requirement.

Provider choice: ElevenLabs, and why not build my own TTS

I picked ElevenLabs for both speech-to-text (their Scribe model) and text-to-speech (Flash v2.5) for three concrete reasons: the TTS model is genuinely low-latency (~75ms to first audio, which keeps TTS off the critical path — see Part 3 for why that matters), it's multilingual out of the box (the app already ships four languages: English, Polish, German, Dutch, and the coach has to reply in the rider's language), and — this one matters for a portfolio decision — building your own STT/TTS is not the interesting problem here. The interesting problem is the system around a good provider: the streaming pipeline, the cost guards, the barge-in teardown, the tool-calling over live state. I said no to a lower-leverage yak-shave so I could spend the effort on the part that actually differentiates the project.

Both paid calls run server-side only — the ElevenLabs API key never reaches the client. The endpoints are thin proxies with auth, quota, and caps wrapped around them, which is the only architecture I'd consider defensible for a client that's a public web/mobile app.

What this bought me — and what it cost

The result: a rider mid-interval can hold a button, ask "how much longer on this?", and hear back a real number pulled from the actual running session — not a canned response, not a hallucinated guess. Barge-in works: talking over the coach cuts its audio and its bill in the same motion.

What it didn't buy me, yet, was speed. The first working version of this pipeline took 5-6 seconds from "rider stops talking" to "first word heard." For something you're supposed to shout mid-interval without breaking cadence, that's not good enough. Part 3 is the story of chasing that number down — and the honest, unglamorous conclusion about exactly where the time was actually going.


This is Part 2 of a 4-part series. Part 1: Why I Built a Voice AI Coach. Part 3: The Latency Engineering. Part 4: What Shipped, What Broke, What's Next.