All Posts
June 6, 2026·12 min read

Why Most AI Agents Break in Production — And What Reliable Agents Actually Look Like

The failure modes that separate demo-grade agents from production-grade ones — and what production-grade agents actually require beneath the surface.

Most AI agents do not fail because the model failed. They fail because the surrounding system was never built to operate one.

A team builds an agent that handles a handful of tasks impressively. Leadership leans in. Scope expands. Six months later the agent is in production, the team is exhausted, and no one can confidently say whether the system is getting better or worse — only that it is more expensive than anyone projected.

The instinct at this point is to blame the model. Teams swap to whichever frontier model launched last month, add reasoning, try a different framework. Accuracy moves by single digits. Reliability does not.

Production-grade agents are not the same artifact as demo-grade agents. They are a different engineering discipline, and most teams have not built the surrounding infrastructure that discipline requires.

What follows is a practical view of why these systems fail in production — and what production-grade agents actually require beneath the surface.


1. The definition problem comes first

The word "agent" now describes everything from a single LLM call with a tool to a multi-step planner with memory, sub-agents, and an open-ended task surface. These systems do not have the same failure modes, do not require the same controls, and should not be evaluated the same way.

Before talking about reliability, a team has to be explicit about the system it is actually shipping:

Treating an open agent as if it were a workflow leads to over-engineering. Treating a workflow as if it were an open agent leads to a system that hallucinates its way through structured tasks. The category matters.


2. Reliability is a long-tail problem

Demo evaluations cover a handful of representative tasks. Production traffic does not look like a handful of representative tasks. It looks like a long tail of unanticipated phrasings, edge cases, partial information, and tasks the agent was never designed to handle but is being asked to handle anyway.

Three failure modes dominate in real production traffic:

None of these are model failures in any pure sense. They are system failures that emerge from the interaction between the model, the tools, and the absence of bounds.


3. Observability is not optional

A traditional service is debuggable because it has logs, stack traces, and deterministic inputs. An agent has none of those by default. When something goes wrong, the team has a final response, a vague sense that the agent "got confused," and very little else.

The minimum useful trace per agent invocation:

Teams that ship agents without this are debugging by re-running prompts and hoping for the same failure. Teams that have it can isolate whether a failure was a planning problem, a tool problem, a context problem, or a budget problem — and fix the right thing.


4. Tool design is where most failures actually live

The single most underweighted skill in agent engineering is tool design. Teams spend weeks tuning prompts and switching frameworks while the tools themselves return ambiguous errors, take parameters the model misuses consistently, or expose surface area the model has no reason to navigate correctly.

Production-quality tools look different from prototype-quality tools:

In most engagements I have seen, a focused pass on tool design moves reliability more than swapping the model does.


5. Cost is a sleeper failure mode

Agents do not fail loudly when they cost too much. They run, they return a result, and the bill arrives at the end of the month.

The compounding factors are well understood but rarely controlled in practice:

The controls are not exotic. Token budgets per invocation. Recursion depth limits. Sub-agent counts. Cost telemetry per agent type so the team can see which agents are economical and which are quietly burning money. Most teams add these only after the first uncomfortable bill.


6. Authorization scope is the new attack surface

A traditional application has credentials scoped to the action a specific endpoint needs to perform. An agent has credentials scoped to the entire surface its tools can reach — which is often broader than the system designers consciously chose.

The questions worth asking before an agent is deployed:

Most enterprises think about agent security after the first incident. The teams that do it before tend to scope tool permissions narrowly, require confirmation for irreversible actions, and design the agent so that a successful injection cannot reach consequential operations without crossing an explicit human gate.


7. Evaluation has to measure process, not just outcomes

Evaluating an agent on task completion alone is misleading. Two agents can hit the same completion rate while differing meaningfully in cost, latency, tool-call accuracy, and the proportion of completions that involved a correctable error along the way.

A defensible eval covers at least:

Teams without process-level eval ship agents whose behavior drifts unobserved as prompts and models change. They cannot tell whether a "better" agent is genuinely better or only better at the small sample they manually reviewed.


8. The ownership gap

An agent's behavior is decided by the system prompt, the tool surface, the retrieval layer, the model choice, the eval suite, and the production telemetry. In most organizations, each of those is owned by a different team — and the user-visible behavior of the agent is owned by no one specifically.

The predictable failure mode is that each team optimizes the surface it controls, no one is accountable for end-to-end quality, and the agent stays at a reliability ceiling everyone can describe but no one can move.

This is not specific to agents. The same pattern shows up in production RAG systems and in AI security programs, for the same structural reason: a system spanning multiple disciplines without explicit end-to-end ownership accumulates local optimizations without moving the user-visible whole.

For agents specifically, a named owner with authority across prompt, tools, retrieval, and eval is the precondition for everything else in this list. Without it, the engineering work happens; the system does not improve.


What production-grade agents actually look like

Production-grade agents have less in common with their demo predecessors than teams expect.

They are explicitly categorized as workflow, bounded, or open — and operated accordingly. They have observability that lets a human reconstruct any past invocation. Their tools are designed for the model, not retrofitted from a human-facing API. They have token budgets, recursion limits, and cost telemetry by agent type. Their authorization scope is narrow, and consequential actions cross human gates. Their evaluation measures process and cost, not just task completion. And they have a named owner accountable for the user-visible behavior of the system as a whole.

None of this depends on a smarter model. It depends on whether the team is building a system or only demonstrating a capability — and most agents in production were shipped before that distinction was made.

The teams still chasing model swaps are chasing the wrong variable. The reliability ceiling in production agents is set by tool design, observability, budget enforcement, eval discipline, and ownership — not by which frontier model they happen to be using this month.

Reliable agents are mostly a matter of building the system around the model — not the model itself.

For the security-specific framing on agentic systems: How AI Systems Actually Fail — And How to Test Them for Security.

If your agents work in evaluation but break in production, or if cost and reliability are no longer predictable as you scale, I work with teams to diagnose where the system actually fails and build the surrounding infrastructure that production-grade agents require. Get in touch.

Subscribe for more

Get posts on LLMOps, RAG, agentic AI, and production AI delivered to your inbox.

Subscribe on Substack