Observability for an LLM pipeline: what to log when "it just answered wrong" isn't enough¶

Per-turn cost lines, stage timing, and why "the coach said something odd" needs a trace, not a hunch.

A rider asks the Wattlog voice coach "how much longer on this?" and gets back a number that's slightly off. Was that a stale snapshot? A tool call that hit the wrong session? The LLM misreading the prompt? Without instrumentation, the honest answer is "no idea" — and that's not acceptable for a pipeline with four hops and real money attached to every turn.

Log the pipeline, not just the outcome¶

The latency chase for the voice coach only found its real bottleneck (LLM completion time, not the network) because every stage was instrumented before anything was optimized: STT duration, LLM completion time, TTS-first-chunk time, end-to-end. That same instrumentation, kept running in production rather than torn out after the optimization pass, is what observability actually is here — not a dashboard bolted on afterward, but the measurement that made the original fix possible, still running.

The minimum viable trace for one voice or chat turn:

turn_id, user_id, timestamp
  stt_duration_ms
  llm_completion_ms
  tool_calls: [ {name, duration_ms, result_size} ]
  tts_first_chunk_ms
  total_latency_ms
  tokens_in, tokens_out
  cost_usd

Every field here answers a question someone will actually ask: "why was this turn slow" (per-stage timing), "why did this cost more than usual" (tokens + tool calls), "did the snapshot optimization actually work" (tool_calls count — Wattlog's inline-snapshot fix was validated by watching this exact field drop to zero on the common path in production logs, not by trusting the design doc).

Cost is a first-class trace field, not a separate system¶

Wattlog logs a cost line item (cost_log) per voice turn — STT, LLM, and TTS spend, individually, per turn — specifically so a monthly bill is a sum of explainable rows, not a number you have to reverse-engineer after the fact. This matters more on a project with a hard per-user quota (check_and_increment_quota) than it would on an unlimited-budget system: if quota enforcement and cost logging live in different places, they drift, and you find out during a billing dispute instead of during code review.

The same instinct applies to barge-in: when a rider interrupts the coach mid-sentence, the trace has to show whether the upstream TTS stream actually closed, not just whether the client stopped playing audio. A UX-only view of "did the interruption work" misses the billing-correctness question entirely — the trace needs to log the teardown of the upstream connection as its own event, not infer it from silence.

Tracing at the tool-call boundary¶

The most useful single trace point in an agentic pipeline is around each tool call: name, arguments (redacted where sensitive — a live-session tool that resolves state server-side from an authenticated user shouldn't log the raw JWT, only the resolved athlete_id), duration, and result size. This is where most "the agent did something weird" bugs actually live — not in the model's language generation, but in a tool call that returned the wrong data, took an unexpected extra round-trip, or silently returned empty when the caller expected a snapshot.

Without this, debugging an agent means re-reading a wall of natural-language output and guessing which sentence corresponds to which internal decision. With it, "why did this answer take 6 seconds" is a SELECT against the trace table, not a re-read of the transcript.

What NOT to do¶

Don't build a bespoke telemetry format when the fields above map cleanly onto existing tracing tools (OpenTelemetry spans, or an LLM-specific layer like Langfuse/Helicone) — reinventing trace storage is wasted effort next to reinventing the actual coaching logic. And don't log full prompt/response pairs by default in a system that touches personal training and health data (HR, power output) without a retention and access policy — a trace table is still a database with the same obligations as any other one holding user data; treat it that way from the first row you write, not after the fact.