Most AI agents do not fail because the model failed. They fail because the surrounding system was never built to operate one.
A team builds an agent that handles a handful of tasks impressively. Leadership leans in. Scope expands. Six months later the agent is in production, the team is exhausted, and no one can confidently say whether the system is getting better or worse — only that it is more expensive than anyone projected.
The instinct at this point is to blame the model. Teams swap to whichever frontier model launched last month, add reasoning, try a different framework. Accuracy moves by single digits. Reliability does not.
Production-grade agents are not the same artifact as demo-grade agents. They are a different engineering discipline, and most teams have not built the surrounding infrastructure that discipline requires.
What follows is a practical view of why these systems fail in production — and what production-grade agents actually require beneath the surface.
1. The definition problem comes first
The word "agent" now describes everything from a single LLM call with a tool to a multi-step planner with memory, sub-agents, and an open-ended task surface. These systems do not have the same failure modes, do not require the same controls, and should not be evaluated the same way.
Before talking about reliability, a team has to be explicit about the system it is actually shipping:
- Workflow: a deterministic path through a small number of tools, with the model choosing branches but not the control flow. Most "agents" in production today are this.
- Bounded agent: the model controls the loop, but the tool surface, recursion depth, and budget are bounded. This is where most reliable enterprise agents live.
- Open agent: the model controls the loop and the planning. Genuinely useful but operationally hard. Rare in production for good reason.
Treating an open agent as if it were a workflow leads to over-engineering. Treating a workflow as if it were an open agent leads to a system that hallucinates its way through structured tasks. The category matters.
2. Reliability is a long-tail problem
Demo evaluations cover a handful of representative tasks. Production traffic does not look like a handful of representative tasks. It looks like a long tail of unanticipated phrasings, edge cases, partial information, and tasks the agent was never designed to handle but is being asked to handle anyway.
Three failure modes dominate in real production traffic:
- Tool-call drift. The agent calls the right tool with subtly wrong arguments — a misformatted parameter, a wrong identifier, a stale value pulled from earlier in the context. The tool returns an error or unexpected result, the agent attempts to recover, and the recovery often makes the original problem worse.
- Plan collapse on novel inputs. When the input does not match the patterns the agent was tuned on, planning quality degrades sharply. The agent commits to an early step it cannot back out of, then accumulates context trying to repair the consequences.
- Loops and runaway tasks. Without explicit budget enforcement, an agent that cannot reach a satisfying answer will keep trying — calling more tools, generating more reasoning, burning more tokens. The failure looks like a slow latency creep until someone notices the bill.
None of these are model failures in any pure sense. They are system failures that emerge from the interaction between the model, the tools, and the absence of bounds.
3. Observability is not optional
A traditional service is debuggable because it has logs, stack traces, and deterministic inputs. An agent has none of those by default. When something goes wrong, the team has a final response, a vague sense that the agent "got confused," and very little else.
The minimum useful trace per agent invocation:
- The full prompt at every model call, with each appended context block attributed to its source
- Every tool call made, the arguments used, the response returned, and the latency of each
- The model's reasoning or planning output at each step, separate from the tool outputs
- Any guardrail, budget, or policy checks that fired
- The final response, plus a structured summary of the path taken
Teams that ship agents without this are debugging by re-running prompts and hoping for the same failure. Teams that have it can isolate whether a failure was a planning problem, a tool problem, a context problem, or a budget problem — and fix the right thing.
4. Tool design is where most failures actually live
The single most underweighted skill in agent engineering is tool design. Teams spend weeks tuning prompts and switching frameworks while the tools themselves return ambiguous errors, take parameters the model misuses consistently, or expose surface area the model has no reason to navigate correctly.
Production-quality tools look different from prototype-quality tools:
- Descriptions written for the model, not the human. The tool description is the contract. Vague descriptions produce wrong calls.
- Narrow, opinionated interfaces. A tool that does one thing reliably is better than a flexible tool the model will misuse.
- Structured errors that tell the model what to do next. "Invalid input" is a bad error. "Invalid input: field 'date' must be ISO-8601" is a useful one.
- Idempotency for any action with side effects. Agents retry. Tools must tolerate retries.
- Cost and latency visible in the tool result when the model is meant to reason about budget.
In most engagements I have seen, a focused pass on tool design moves reliability more than swapping the model does.
5. Cost is a sleeper failure mode
Agents do not fail loudly when they cost too much. They run, they return a result, and the bill arrives at the end of the month.
The compounding factors are well understood but rarely controlled in practice:
- Context growth per step. Each agent step appends to the prompt — prior reasoning, tool outputs, retrieved context. By step ten, the model is processing many times the tokens of step one.
- Tools that return large payloads. A verbose tool response inflates every subsequent step in the loop.
- Recursive or sub-agent spawning without explicit budget allocation. One outer invocation can fan out into dozens of sub-calls.
- Latency-driven retries. A timeout that retries the entire invocation doubles the cost of every slow call.
The controls are not exotic. Token budgets per invocation. Recursion depth limits. Sub-agent counts. Cost telemetry per agent type so the team can see which agents are economical and which are quietly burning money. Most teams add these only after the first uncomfortable bill.
6. Authorization scope is the new attack surface
A traditional application has credentials scoped to the action a specific endpoint needs to perform. An agent has credentials scoped to the entire surface its tools can reach — which is often broader than the system designers consciously chose.
The questions worth asking before an agent is deployed:
- What is the worst-case action the agent could take if it followed an instruction it should have refused?
- How would that instruction reach the agent — direct user input, retrieved content, a tool result, a sub-agent?
- Which actions require human confirmation, and which do not?
- If the model is jailbroken or prompt-injected, what is the recovery path?
Most enterprises think about agent security after the first incident. The teams that do it before tend to scope tool permissions narrowly, require confirmation for irreversible actions, and design the agent so that a successful injection cannot reach consequential operations without crossing an explicit human gate.
7. Evaluation has to measure process, not just outcomes
Evaluating an agent on task completion alone is misleading. Two agents can hit the same completion rate while differing meaningfully in cost, latency, tool-call accuracy, and the proportion of completions that involved a correctable error along the way.
A defensible eval covers at least:
- Task completion — did the agent achieve the user-visible goal?
- Path quality — did it take a reasonable route, or did it stumble through several wrong branches before recovering?
- Tool-call accuracy — did it call the right tools with the right arguments, or did it improvise?
- Cost and latency per task — at what economic envelope did the completion occur?
- Failure mode classification — when it failed, what type of failure was it? Without this, the team cannot prioritize fixes.
Teams without process-level eval ship agents whose behavior drifts unobserved as prompts and models change. They cannot tell whether a "better" agent is genuinely better or only better at the small sample they manually reviewed.
8. The ownership gap
An agent's behavior is decided by the system prompt, the tool surface, the retrieval layer, the model choice, the eval suite, and the production telemetry. In most organizations, each of those is owned by a different team — and the user-visible behavior of the agent is owned by no one specifically.
The predictable failure mode is that each team optimizes the surface it controls, no one is accountable for end-to-end quality, and the agent stays at a reliability ceiling everyone can describe but no one can move.
This is not specific to agents. The same pattern shows up in production RAG systems and in AI security programs, for the same structural reason: a system spanning multiple disciplines without explicit end-to-end ownership accumulates local optimizations without moving the user-visible whole.
For agents specifically, a named owner with authority across prompt, tools, retrieval, and eval is the precondition for everything else in this list. Without it, the engineering work happens; the system does not improve.
What production-grade agents actually look like
Production-grade agents have less in common with their demo predecessors than teams expect.
They are explicitly categorized as workflow, bounded, or open — and operated accordingly. They have observability that lets a human reconstruct any past invocation. Their tools are designed for the model, not retrofitted from a human-facing API. They have token budgets, recursion limits, and cost telemetry by agent type. Their authorization scope is narrow, and consequential actions cross human gates. Their evaluation measures process and cost, not just task completion. And they have a named owner accountable for the user-visible behavior of the system as a whole.
None of this depends on a smarter model. It depends on whether the team is building a system or only demonstrating a capability — and most agents in production were shipped before that distinction was made.
The teams still chasing model swaps are chasing the wrong variable. The reliability ceiling in production agents is set by tool design, observability, budget enforcement, eval discipline, and ownership — not by which frontier model they happen to be using this month.
Reliable agents are mostly a matter of building the system around the model — not the model itself.
For the security-specific framing on agentic systems: How AI Systems Actually Fail — And How to Test Them for Security.
If your agents work in evaluation but break in production, or if cost and reliability are no longer predictable as you scale, I work with teams to diagnose where the system actually fails and build the surrounding infrastructure that production-grade agents require. Get in touch.