The Latency Engineering¶
5-6s → 2.5-3s, the real bottleneck, a WebKit gotcha, and a deferred streaming rewrite — Part 3 of a 4-part series on building a real-time voice AI coach into Wattlog.pro.
Part 2 covered the architecture: speech-to-text → tool-using LLM agent → streaming text-to-speech. This part is about the number that actually matters for a voice interface — how long between "rider stops talking" and "rider hears the first word back" — and the unglamorous, honest process of chasing it down.
Measure before you optimize (the boring rule that's always right)¶
The v1 build had no fixed latency target pulled from a spec doc. I deliberately refused to write down "sub-1.5s" before measuring anything, because a made-up number invites you to optimize the wrong hop. What I did instead: instrument every stage (STT duration, LLM completion time, TTS-first-chunk time, end-to-end) and let the real bottleneck declare itself.
The first honest measurement: ~5-6 seconds, mic-stop to first audio. Broken down:
- STT (batch transcription of the push-to-talk clip): small, fixed — the floor you can't stream away since push-to-talk sends one complete short clip, not a live stream.
- LLM completion: the dominant cost. 1-3 seconds for a short reply, and worse when the agent decided to call a tool first (a full extra round-trip before the model even starts generating the answer).
- TTS first-chunk: ElevenLabs Flash v2.5 streaming, measured directly — around 75ms to first audio byte once synthesis starts. Genuinely fast. Not the bottleneck.
- Client-side buffering: the original hook fetched the entire TTS
response as one
arraybufferand only started playback after the whole clip finished downloading — throwing away the streaming TTS provider's main advantage.
So the honest conclusion, which is less flattering than "I streamed everything and hit 500ms" but is the actually useful engineering insight: the LLM completion is the bottleneck, not the network, and not the TTS provider. Any fix that doesn't address that is polishing the wrong part of the pipeline.
Three cheap wins, shipped first¶
Before attempting the riskiest rewrite (streaming the LLM's output token-by-token into a TTS websocket — more on that below), I looked for wins that didn't require rearchitecting the agent loop:
- Cap the reply length and tighten the spotter's brevity in the prompt. A shorter target reply is fewer tokens to generate, which is directly proportional to LLM completion time. This is the least interesting fix and one of the most effective — a persona designed to be terse ("Twenty-five seconds. Hold two-eighty.") isn't just better spoken UX, it's a latency optimization in disguise.
- Progressive TTS playback with a buffered fallback, replacing the "download everything, then play" client behavior. On platforms that support it, audio starts playing as chunks arrive instead of waiting for the full clip.
- Skip the tool round-trip on the common path by relying on the inline
live-session snapshot (Part 2) instead of forcing every voice question
through
get_live_session_state. This is the change that mattered most in practice: it eliminated an entire extra LLM round-trip for the majority of real questions.
Measured result after these three: ~5-6s → ~2.5-3s server-side, with
tool_calls=0 on effectively every real turn (confirming the snapshot
covers the common case) and the worst-case ~6-second tool-call outliers
gone entirely.
The gap that's left, and why it's platform-specific, not effort-specific¶
2.5-3 seconds is "acceptable, still not great." Digging into why it wasn't reliably sub-2s surfaced something I hadn't expected: the remaining gap wasn't uniform across platforms. It was concentrated on desktop.
The progressive-playback win relies on MediaSource.isTypeSupported('audio/mpeg')
being true, so the client can append MP3 chunks to a SourceBuffer as they
arrive. That's true on Chrome/Blink — so web users on Chrome genuinely get
early playback. It's false on Safari/WebKit, which is what the desktop
shell (a pywebview/WKWebView-based app) runs on. On desktop, the same code
silently falls back to buffering the entire clip before playing anything —
the exact behavior the fix was supposed to eliminate, just for one platform
instead of all of them.
This is the kind of bug that's easy to miss if you only test in a normal
browser: the code path looks identical, the API surface (MediaSource,
SourceBuffer) is standard, and it silently degrades instead of throwing an
error. I only caught it by testing on the actual desktop shell, not just
Chrome — a reminder that "cross-platform" web APIs quietly aren't, and the
only way to know is to run the thing on the platform you're shipping to.
The next lever: streaming the LLM itself, and why I'm sequencing it carefully¶
The plan I wrote up for the next optimization pass is the riskiest remaining
diff: stream the LLM's tokens directly into ElevenLabs' streaming-input
TTS websocket, so audio synthesis starts before the model finishes
generating the reply, instead of waiting for the full completion. Projected
result: sub-2s, and on the currently-tool-free common path (since
tool_calls=0 in practice now), fully streamable end to end.
I have not shipped this yet, and I want to be specific about why, because the "why not" is itself useful engineering judgment for an interview conversation:
- The bounded tool-loop that classifies refusals, malformed replies, and cap-reached states currently only works on a complete message. Making it stream-aware means classifying a reply while it's still arriving — which means a refusal or truncation might only be detected after some audio has already been synthesized and is playing. That's not a bug to "just fix" — it's a real design tradeoff between latency and the ability to cleanly cancel a bad reply before the rider hears any of it. My plan accepts partial audio on the rare refusal path rather than buffering the whole turn (which would forfeit the entire latency win) — a decision I'd want to explain and defend, not paper over.
- Barge-in now has to tear down two independent upstream connections instead of one — the LLM stream and the TTS websocket — inside a single structured-concurrency scope, so a client disconnect, an LLM error, or a websocket error all cancel both cleanly. Get this wrong and you leak a paid connection that keeps generating (and billing for) speech nobody's listening to.
- A stalled reader shouldn't be able to hold two paid upstreams open indefinitely just because the daily quota counts turns, not concurrent duration. That needs its own per-user concurrency lease and an absolute wall-clock timeout on top of the existing quota — a gap I wouldn't have thought to close if I hadn't asked "what happens if a client just... doesn't read the response?"
I sequenced this work with an explicit go/no-go step first: measure the STT floor, the real tool-call rate, and whether the desktop shell's MediaSource gap even matters anymore, before committing to the riskiest rewrite. If STT alone already eats most of the budget, or if most real turns still hit a tool, streaming the LLM barely moves the needle and isn't worth the concurrency-safety surface area it adds. That's currently parked as a documented, scoped follow-up rather than shipped — a decision I stand behind: 2.5-3s is a real, measured, acceptable number for a v1; the remaining ~1s of headroom is diminishing returns weighed against other priorities on a solo project.
The actual lesson¶
The satisfying version of this story would be "I streamed everything and hit 400ms." The true version is more useful: most of the latency was in the LLM, not the network; the cheapest fixes (shorter output, skip the avoidable round-trip) bought the most; and the last mile is gated behind a genuine concurrency-safety design problem, not just more engineering effort. Knowing when not to ship the riskiest optimization — because you measured that it's not yet worth its risk — is itself the skill.
This is Part 3 of a 4-part series. Part 1: Why I Built a Voice AI Coach. Part 2: Anatomy of a Live Voice Loop. Part 4: What Shipped, What Broke, What's Next.