AI Agent Failures: Real Incidents and What Actually Went Wrong

March 29, 2026 · Editorial Team · 8 min read · ai-safety ai-agents failure-modes

The AI agent failure stories that get shared publicly tend to be sanitized: "the system made an error and we corrected it." The technical details of what actually went wrong, why the safeguards didn't catch it, and what the fix actually was, those are harder to find. This article covers real incident patterns from 2025 and 2026, reconstructed from post-mortems, public reports, and technical discussions, to show the failure mechanisms that actually matter.

Incident 1: The refund agent that gave away $1.2M

What happened: A mid-size e-commerce company deployed a customer service agent in Q3 2025 to handle return and refund requests. The agent had access to the order management system and could issue refunds up to $500 without human review. Within three weeks, a small number of users discovered that by rephrasing their requests in ways that matched the agent's training distribution for "legitimate refund requests," they could receive refunds on orders that didn't qualify under the company's policy. The total exposure before detection was approximately $1.2M across 340 transactions.

The root cause: The refund eligibility logic was implemented through natural language instructions to the LLM: "Issue a refund if the order was placed within 30 days and the customer describes a product defect or delivery issue." This works for the majority of straightforward cases. But the LLM's interpretation of "describes a delivery issue" was broader than the policy team intended, and users who received a refund posted about the phrasing in a forum, which accelerated the exploitation.

The deeper issue: using an LLM's natural language judgment as a security gate. LLMs don't have consistent decision boundaries the way rule-based logic does. The same system that correctly refuses a borderline case on Tuesday might approve it on Wednesday because of slight phrasing differences or context variations.

What fixed it: Refund eligibility was moved to a deterministic rule engine that the LLM called as a tool. The tool takes order ID and reason code (a structured field, not free text) and returns a boolean. The LLM's role was reduced to extracting the reason code from free-text conversation and presenting the tool's decision to the customer. The LLM never decides whether a refund is issued; it only interacts with the customer. A human reviews all refund approvals over $200.

Lesson: Never use an LLM's judgment as the authorization gate for irreversible financial actions. The LLM can gather information and present results, but the authorization logic should be deterministic and auditable.

Incident 2: The coding agent that pushed broken infrastructure

What happened: An infrastructure team at a SaaS company enabled a coding agent (Claude Code with full repository access) to autonomously handle a specific class of Terraform issues: flagged configuration drift where the live infrastructure differed from the Terraform state. The agent was supposed to update the Terraform files to match the actual state. In January 2026, the agent encountered a drift case involving security group rules. It correctly identified the discrepancy but, in resolving it, removed a security group rule that blocked inbound access on port 5432 (PostgreSQL). The rule had been added manually as an emergency response to a security incident and wasn't documented in the Terraform files. The agent's change passed CI (which tested the Terraform syntax and plan, not the semantic security implications) and was merged. The database was briefly exposed before the security team caught it in a routine audit 36 hours later. No data was accessed, but the exposure was real.

The root cause: The agent was operating with incomplete context. It knew the desired Terraform state (the files), it knew the actual infrastructure state (AWS), and it resolved discrepancies by updating files to match reality. It had no access to the security incident log, no knowledge that the manual rule had been added deliberately, and no way to recognize that "this rule exists in AWS but not in Terraform" might indicate a missing Terraform resource rather than something to remove.

What fixed it: Three changes. First, the agent's scope was narrowed to read-only access for drift detection, with all write operations requiring explicit human approval. Second, security group modifications were added to a blocklist of changes the agent cannot make autonomously. Third, the team added a documentation requirement: any manual infrastructure change that diverges from Terraform must be accompanied by a comment in the relevant Terraform file within 24 hours, explaining the deviation. This creates a record the agent can read.

Lesson: Agents operating on infrastructure need awareness of "why things are the way they are," not just "what state they're in." Automated systems that can't distinguish an intentional exception from a drift to be corrected will eventually erase important intentional exceptions.

Incident 3: The voice agent that invented a cancellation policy

What happened: A subscription software company deployed a voice AI agent for customer retention calls in mid-2025. The agent's job was to handle customers who called to cancel, offer retention incentives, and process cancellations when customers insisted. The agent was given the actual cancellation policy in its system prompt, which included a 30-day notice period for annual plan customers.

Over about six weeks, the agent began telling customers on monthly plans that they also needed to give 30-day notice, even though only annual plan customers had this requirement. Customers who had been told they needed to wait were understandably upset, several disputed charges, and the company faced regulatory complaints in two jurisdictions about the misrepresented policy.

The root cause: Hallucination through distributional confusion. The system prompt described both monthly and annual plan rules. The annual plan cancellation policy, being more complex and more explicitly documented, was more prominent in the agent's context. Over time, or across certain paraphrasing conditions, the agent began applying the more salient rule to cases where the simpler rule applied. This is a known failure mode: when two similar entities have different rules, LLMs sometimes apply the more "memorable" or frequently mentioned rule to both.

Compounding this: there was no ground truth check. The agent's outputs were never verified against the actual policy. A post-hoc analysis found that the error appeared in approximately 12% of monthly plan cancellation conversations.

What fixed it: Policy statements for each plan type were moved into separate tools, not included inline in the system prompt. The agent calls get_cancellation_policy(plan_type="monthly") or get_cancellation_policy(plan_type="annual") and reads the result rather than relying on memory of the system prompt. The tool returns the exact policy text, which the agent quotes verbatim rather than paraphrasing. A shadow monitoring process was added that samples 5% of calls and checks agent policy statements against the ground truth for each customer's plan type.

Lesson: For compliance-sensitive factual claims, don't rely on the model's memory of a system prompt. Retrieve the specific applicable policy dynamically and have the model read it directly. Paraphrasing policies from memory will eventually drift.

Incident 4: The research agent that cited papers that don't exist

What happened: A legal technology company built an AI research assistant for case preparation. The assistant would research relevant case law, summarize findings, and produce a memorandum with citations. In March 2026, an attorney used the system to prepare arguments in a federal case. Three of the cited cases in the memorandum didn't exist: the agent had hallucinated plausible-sounding case names, courts, and holdings. The attorney caught two before filing; one made it into a draft brief that was caught in internal review. No false citations were filed, but the incident caused a near miss and significant internal concern about the system's reliability.

The root cause: The retrieval pipeline was retrieving real case summaries, but for queries where relevant cases were rare or the query touched on a niche legal area, the model was filling gaps from training data rather than from retrieved documents. The system prompt said "cite only cases in the retrieved documents," but when the model's confidence was high and the retrieved documents were sparse, it occasionally generated citations it "knew" from training, treating them as equivalent to retrieved ones.

The system had no citation verification step. Generated case citations were accepted as-is.

What fixed it: Every generated citation is now verified against a legal database (Westlaw API) before inclusion in a memorandum. Citations that can't be verified are flagged with a warning and sent to an attorney for review before the memorandum is finalized. The system prompt was updated to include explicit language: "If you cannot find a relevant case in the retrieved documents, state that no relevant cases were found rather than generating a citation from memory." The monitoring team added a metric for "verification failure rate" per query type, which identified the niche legal areas where retrieval quality was lowest and prompted targeted improvement of the document index.

Lesson: Don't trust citations from models without verification against an authoritative source. The model's confidence in a citation is not correlated with the citation's accuracy.

Cross-cutting patterns

Looking across these incidents, a few patterns emerge:

Irreversible action without authorization controls. Three of the four incidents involved agents taking actions that couldn't easily be undone (issuing refunds, modifying infrastructure, communicating a policy to a customer who may act on it). The fix in each case involved adding human authorization gates for irreversible actions or narrowing the scope of what the agent could do autonomously.

Implicit reliance on LLM judgment for policy enforcement. The LLM is good at understanding language. It's bad at consistently enforcing specific business rules, especially when those rules have exceptions and edge cases. Policy enforcement belongs in deterministic logic; the LLM's job is to understand context and call the appropriate tool.

Missing ground-truth verification. Several incidents involved the agent asserting facts that could have been verified against an authoritative source. Adding a verification step between generation and delivery catches most of these failures before they reach users.

No monitoring for semantic errors. Code deploys have automated tests. AI agent deployments often don't have equivalent checks for whether the agent is saying accurate things. Shadow monitoring (sampling a percentage of agent outputs and evaluating them against ground truth) is now standard practice in well-run AI deployments, but it wasn't in most of these cases at the time of the incident.

The agents in these incidents weren't doing anything bizarre; they were doing what their training and prompting led them to do. The failures were architectural: insufficient authorization controls, reliance on LLM judgment for things LLMs aren't reliable at, and absent monitoring. These are fixable problems, which is the useful thing to know.