Structured output: why I stopped parsing [SUGGEST] tags out of free text¶
A real bug from the Wattlog coach agent, and the fix: Pydantic-typed model output instead of regex over prose.
A FastAPI endpoint expects predictable JSON. An LLM, left to its own devices, produces an essay. The gap between those two things is where a lot of production AI bugs live — and Wattlog's coach agent had a textbook example of it.
The bug¶
The text coach can propose a workout mid-conversation: "want to try this FTP session tomorrow?" The first version of that feature worked by having the model emit a marker in its reply — something like [SUGGEST]workout_id=42[/SUGGEST] — embedded inside otherwise normal prose, and a regex on the server side pulled the marker out to trigger the "would you like to start this workout?" UI flow.
It worked, most of the time. It also failed in exactly the ways you'd expect from parsing structure out of free text generated by a model that has no contract to follow:
- The model occasionally reformatted the tag (
[Suggest],[suggest], extra whitespace inside the brackets) — cosmetic to a human reader, invisible to a strict regex. - Sometimes it described a suggestion in prose without emitting the tag at all — the intent was there, the structure wasn't, and the UI flow silently never triggered.
- Adding a second structured thing to the reply (say, a plan adjustment alongside the suggestion) meant inventing a second bespoke tag syntax and writing a second bespoke parser.
None of this is a "the model is bad" problem. It's a "we asked for structure informally and are surprised it's informal" problem.
The fix: make structure a contract, not a convention¶
The shape that actually holds up is: define what a valid response looks like as a schema, and force the model's output to conform to it — rather than asking nicely in the prompt and hoping the regex on the other end doesn't drift.
from pydantic import BaseModel
from typing import Literal
class CoachReply(BaseModel):
message: str
suggestion: WorkoutSuggestion | None = None
class WorkoutSuggestion(BaseModel):
workout_id: int
reason: str
action: Literal["propose"] = "propose"
With a library like Instructor (or a provider's native structured-output / tool-calling mode), the model is constrained to return something that validates against CoachReply — not text that probably contains a recognizable pattern. If validation fails, that's a typed, catchable exception at the API boundary, not a silently-missed regex on a Friday afternoon.
The practical difference: suggestion is either a well-formed WorkoutSuggestion or None. There is no third state where the model "meant to suggest something" but the parser didn't catch it. The ambiguity that caused the original bug doesn't have anywhere to hide anymore.
Where this generalizes¶
The same shape applies anywhere an LLM's output feeds a downstream system that isn't itself a human reading prose:
- A tool-calling agent's decision about which tool to call and with what arguments — this is structured output whether or not you think of it that way; most providers' native tool-calling is Pydantic-shaped validation under the hood.
- Classification tasks (intent detection, sentiment, routing) — the tempting shortcut is "ask for one word and match a string," which has the exact same drift problem as the
[SUGGEST]tag. - Any reply that has to satisfy both a human (readable prose) and a machine (a specific field) at once, the way
CoachReply.messageandCoachReply.suggestiondo here — don't try to encode the machine part inside the human part.
The tradeoff worth naming¶
Structured output isn't free. Constraining generation can cost a small amount of quality on the free-text parts of a reply, and it adds a schema to maintain — every new field is a compatibility question for every existing caller. The trade is worth it specifically when a broken parse has a real downstream cost (in this case: a coaching feature that silently doesn't fire), and not worth the ceremony for a reply nobody but a human ever reads. The CoachReply.message field stays free text on purpose — only the part something else has to act on gets a schema.