Skip to content

AI agent harness: from a verifier loop to production

Notes after taking Michał Kamiński's „Harness od zera" (Harness from Zero) course — worth going through it yourself; here I collect the essence and add my own take.

Why "harness", not "prompt"

The biggest mistake when starting with agents: treating them as a fancier prompt. The model alone does nothing — it doesn't read files, run code, or know whether it finished a task. The harness is everything around the model that turns text generation into problem-solving: the loop, tools, memory, guardrails.

The core intuition: the difference between a plain prompt and a loop with a verifier. A loop only works when it has an objective definition of "done" — tests, types, a build, anything that tells the truth independent of the model's self-assessment. "Be thorough" is a plea. "Don't stop until run_tests returns PASSED" is a contract.

agent = create_agent(
    model="openai:gpt-5.4",
    tools=[read_file, write_file, run_tests],
    system_prompt=(
        "Implement the code until run_tests returns PASSED. "
        "Run tests after every change. Don't stop before green tests."
    ),
)

The verifier (run_tests) must return a binary PASSED/FAILED — no ambiguity, no percentages. The model has no concept of "partial success."

Before building a loop at all, check whether you need one: if the steps and their order are known and fixed upfront, a rigid script is cheaper and more reliable than an agent.

The five pillars of an agent

The 2026 consensus on "what makes up an agent":

  • Prompt — a contract, not a spell. What matters most: an objective "done" condition.
  • Model — how much it should reason (a per-step reasoning budget, not a global switch) and which provider to pick (provider:model as swappable config, not a dependency).
  • Context — what the model sees right now. The window is an attention budget, not storage.
  • Tool use — what it can do. In 2026 this increasingly means CLIs wrapped in a "skill," not always MCP.
  • Harness — the loop orchestrating all of it.

Context is a budget, not storage

Every loop iteration adds to the window: messages, tool outputs, plans. Without hygiene, signal drowns in noise and cost grows linearly with every turn. Four moves (LangChain/Anthropic):

  1. Offload — move information outside the window (scratchpads, files, a TODO list living in graph state).
  2. Select — pull into the window only what's needed right now.
  3. Compress — summaries instead of full history, placeholders instead of stale tool dumps.
  4. Isolate — a subagent explores in its own window and hands the parent a distillate, not 200 lines of logs.

Three of these moves ship as ready-made LangChain middleware: SummarizationMiddleware, ContextEditingMiddleware + ClearToolUsesEdit, LLMToolSelectorMiddleware. Clearing stale tool results is usually the highest-leverage move — that's what bloats fastest.

Memory: what survives a session

Notes in the window die with the task. Long-term memory lives in a persistent store, not chat history. The CoALA taxonomy splits it into three types:

  • Semantic — facts (a user profile, project configuration).
  • Episodic — concrete past experiences, "recipes" for previously solved errors.
  • Procedural — habits that change every future run — which is exactly why it needs a human gate before activation.

Without memory, the agent pays a "first-session tax" in every new thread — the same mistakes, the same questions, starting from zero.

Caching: don't pay twice for the same thing

System prompt, tool schemas, and reference docs repeat every turn of the loop. Anthropic requires explicit cache_control on content blocks (order matters — stable content first, variable content last, since any change before the breakpoint invalidates the cache). OpenAI caches the prefix automatically. This is usually the cheapest lever for cost and latency on long runs.

RAG: match the level to the question

RAG is a spectrum, not one choice — from naive matching to a hybrid (dense + BM25) with a reranker. The production default is the hybrid: ~100 candidates cheaply, a cross-encoder narrows to the top 5–10. Worth remembering: ~80% of RAG failures come from the ingest and chunking layer, not the model. Before tuning prompts, check whether retrieval is even returning the right context.

Middleware: layers around the loop

Middleware is the official way to hook into the loop without rewriting it — six hooks (before_agent, before_model, after_model, after_agent, wrap_model_call, wrap_tool_call). A practical pattern: a no-progress circuit breaker for when a call-count limit isn't enough on its own (the agent can stay within budget while looping on the same failure):

class StopWhenStuck(AgentMiddleware):
    """Break when the same FAILED repeats twice in a row — the agent is stuck."""
    @hook_config(can_jump_to=["end"])
    def before_model(self, state, runtime):
        r = [m.content for m in state["messages"]
             if m.type == "tool" and getattr(m, "name", "") == "run_tests"]
        if len(r) >= 2 and r[-1] == r[-2]:
            return {"jump_to": "end"}
        return None

Why before_model, not after_model: the breaker should decide whether to loop at all before you pay for a model call.

Multi-agent: not always, and not by gut feel

This is the one topic in the course where two serious companies published contradictory manifestos — both backed by data. What settles it is the task's axis. Breadth-first (independent branches, result = synthesis) — parallel topologies pay off. Depth-first (entangled decisions, shared context) — a single agent wins.

Hard numbers from the MAST study (1642 traces, 7 frameworks): multi-agent systems fail on 41–87% of tasks, and 32.3% of failures come from misalignment between agents — a failure class that doesn't exist in single-agent systems. The overriding rule: start solo, only add complexity once you've measured a concrete payoff.

Evaluation: measure delta, not vibes

Three levels of scoring: final answer (cheapest, tells you whether), single step (tells you where it hurts), trajectory (tells you how). Deterministic metrics (exact match, regex, schema) for closed-form outputs; LLM-as-judge for open-ended ones — but the judge adds its own noise (positional bias, length bias, self-preference). Just as the agent needs a verifier, you need your own: a dataset with a binary "done" and a pinned baseline, so you can tell a real improvement from a hunch.

Guardrails: the "lethal trifecta"

Simon Willison's pattern: prompt injection becomes structurally possible when an agent has three capabilities at once — (1) tools with external side effects, (2) access to untrusted data, (3) an exfiltration path (egress). Remove any one leg and there's nothing for an attacker to hijack. A prompt alone ("don't trust this content") won't stop it — deterministic layers (tool allowlists, secret scanning, cut-off egress, a human gate on irreversible actions) will.

Durability and the meta-loop

Processes crash, runs take hours — PostgresSaver (instead of InMemorySaver) checkpoints state so the agent resumes from the last point after a restart. The necessary condition: tools must be idempotent — retrying send_email after a resume is not the same as retrying write_file.

The thread tying the whole course together is the meta-loop rule: when the agent makes a mistake, don't patch the prompt with one more sentence. Identify the failure class (context? tool? guardrail?), add the case to your dataset, fix it structurally, measure the delta. The harness grows with every fixed failure, not with reading more documentation.

Deep agents: harness "with batteries included"

create_deep_agent isn't a new framework — it's create_agent with an opinionated, ready-made middleware stack (planning, filesystem, subagent delegation, summarization, human gate) assembled behind one function. Reach for it when the task is multi-step and needs its own plan; if a single loop with 2–3 layers is enough, create_deep_agent is just overhead and less control.

Takeaway

The model is a commodity — the harness is the moat, but the moat should stay thin. Start with the simplest thing that works (a loop plus a verifier), add layers on evidence rather than fashion, and treat every production failure as a signal for a structural fix, not another line in the prompt.

The full course, with interactive simulators and complete code: harness.michalsk.pl.