How to Debug AI Agents When They Go Wrong
Something about AI agent debugging is different from regular software debugging. In a conventional codebase, when something goes wrong, you have a deterministic execution path. You can set a breakpoint, inspect the state at each step, and trace exactly where the logic diverged from your expectation.
With agents, you're debugging a probabilistic system. The same input can produce different outputs on different runs. The "logic" is a model's learned weights, not a function you can read. And the failure modes are often subtle: not a crash or an exception, but a silently wrong answer, a misunderstood instruction, or a task that's 90% complete and quietly abandoned.
This means the debugging toolkit is different. Here's what actually works.
The failure modes you'll actually encounter
Before you can debug, you need to be able to name what went wrong. AI agent failures fall into a handful of recurring patterns.
Prompt misinterpretation. The agent understood something different from what you intended. This happens most often with ambiguous instructions, with instructions that conflict with the model's training priors, or with prompts that are clear to a human but contain edge-case language that the model pattern-matches to a different scenario.
Context bleed. The agent is being influenced by content in its context that you didn't intend as instruction. A document in the context that discusses a different approach to a problem can cause the model to drift toward that approach. System prompt + user message + tool results + earlier conversation turns all compete for the model's attention.
Tool call failures. The agent calls a tool with wrong arguments, or the tool returns an error that the agent doesn't handle correctly. This can cause silent fallback behavior (the agent proceeds as if the tool call succeeded) or catastrophic loops (the agent retries a failing tool call indefinitely).
Hallucinated tool calls. The agent invents a tool call to a tool that doesn't exist in its toolkit, or fabricates arguments to a tool based on what seems plausible rather than what the actual API requires.
Planning failures. In multi-step tasks, the agent's plan has a flaw in step 3 that only becomes apparent in step 7. By the time you see the wrong output, the agent has taken 4 more steps based on the flawed assumption. Unwinding this requires understanding the full reasoning chain.
Premature completion. The agent declares a task complete before it actually is. This is more common in agents with less capable models or with poorly specified completion criteria.
Logging: start here before anything else
The foundation of agent debugging is logging. Before you add any tracing infrastructure, you need to be able to see what your agent is actually doing.
At minimum, log:
- Every prompt sent to the model (the complete text, not a summary)
- Every model response (the complete text)
- Every tool call and its arguments
- Every tool response
- Timestamps for each of the above
This sounds obvious. Most agent frameworks log some subset of this by default. The problem is that the default logging is usually compressed or formatted in ways that make it hard to reconstruct exactly what the model saw. When debugging, you want raw text, not a pretty summary.
In Python with the Anthropic SDK (Claude 3.7 Sonnet as of this writing):
import anthropic
import json
import time
from pathlib import Path
client = anthropic.Anthropic()
def log_interaction(event_type, data, session_id):
log_path = Path(f"logs/{session_id}.jsonl")
log_path.parent.mkdir(exist_ok=True)
with open(log_path, "a") as f:
f.write(json.dumps({
"timestamp": time.time(),
"event": event_type,
"data": data
}) + "\n")
def create_message(messages, tools, session_id):
log_interaction("request", {
"messages": messages,
"tools": [t["name"] for t in tools]
}, session_id)
response = client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=4096,
tools=tools,
messages=messages
)
log_interaction("response", {
"stop_reason": response.stop_reason,
"content": [block.model_dump() for block in response.content]
}, session_id)
return response
This gives you a JSONL file per session that you can replay, inspect, and search through. JSONL is preferable to a flat log file because each line is a valid JSON object you can parse programmatically.
Langfuse for production observability
Once you're beyond simple logging and running agents in production, you need a proper observability platform. Langfuse is the tool I'd recommend starting with.
Langfuse captures traces of your agent's execution, organizes them into hierarchical spans (you can see the full multi-step reasoning chain), and lets you search across traces by model, by prompt version, by output quality, or by any custom metadata you attach.
The integration is minimal:
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context
langfuse = Langfuse()
@observe()
def research_agent(query: str) -> str:
with langfuse_context.update_current_observation(
input={"query": query},
metadata={"agent_version": "1.2.0"}
):
# your agent logic here
result = run_agent(query)
langfuse_context.update_current_observation(output={"result": result})
return result
The @observe() decorator wraps the function in a Langfuse trace. Every model call inside gets automatically captured. For nested tool calls, Langfuse creates child spans automatically.
What makes Langfuse specifically useful for debugging is the ability to replay a trace. You can find a specific failing execution, see the complete state of the conversation at each step, and identify exactly where the reasoning went wrong. This is much faster than trying to reproduce an edge case from scratch.
Langfuse also supports human evaluation annotations, where you or your team can mark specific traces as correct or incorrect. Over time, this creates a labeled dataset of failure cases you can use to improve your prompts.
LangSmith for LangChain-based agents
If you're using LangChain or LangGraph (common choices as of early 2026), LangSmith is the equivalent observability platform. It integrates more deeply with the LangChain abstractions than Langfuse does, which is an advantage if you're using LangChain's agent executor, chain, or memory abstractions.
The setup is environment-variable based:
export LANGCHAIN_TRACING_V2=true
export LANGCHAIN_API_KEY=ls__yourkey
export LANGCHAIN_PROJECT=my-agent-project
Once those variables are set, LangChain automatically sends traces to LangSmith. No code changes needed.
LangSmith's debugging-specific feature worth knowing about is the "playground" view. For any captured trace, you can open the prompt in an interactive playground, modify it, and re-run it to see if your change fixes the issue. This dramatically speeds up the prompt iteration cycle because you're working with real failing examples rather than constructed test cases.
For agents with complex tool use, LangSmith also shows a tree view of all the tool calls and their nesting, which makes it easy to see when a tool chain went wrong and at which step.
Helicone for cost and performance monitoring
Helicone sits at a different layer than Langfuse and LangSmith. It's a proxy that intercepts your OpenAI or Anthropic API calls and captures metrics: latency, token counts, costs, error rates. It requires zero code changes if you're using the OpenAI or Anthropic SDK, just a base URL change:
import anthropic
client = anthropic.Anthropic(
base_url="https://anthropic.helicone.ai",
default_headers={
"Helicone-Auth": f"Bearer {HELICONE_API_KEY}"
}
)
Helicone is less useful for debugging individual agent failures and more useful for identifying systemic problems: a spike in latency that correlates with a model update, an unexpected increase in token consumption after a prompt change, error rates by model or endpoint.
In my production agents, I use Helicone alongside Langfuse. Helicone gives me the aggregate view (is something systemically wrong?), Langfuse gives me the trace-level view (what exactly happened in this specific failing case?).
Prompt tracing: replaying what the model actually saw
One of the more powerful debugging techniques is prompt tracing: reconstructing the exact input the model received for a failing case, then running it through a prompt testing tool to understand why it produced the wrong output.
The Anthropic Console has a "prompt improver" and a workbench where you can paste in a prompt and run it with different models or modified versions. For any production failure, the workflow is:
- Find the failing trace in your logging/observability system
- Extract the complete prompt (system + conversation + tool definitions)
- Paste into the console workbench
- Reproduce the failure, then experiment with prompt modifications
This is slow but it's the most reliable way to understand model behavior. Automated prompt evaluation tools can speed this up: you capture a set of failing examples and then run them in a batch against different prompt variants.
Handling the most common failure patterns
For prompt misinterpretation: Add an explicit clarification step. Before the agent starts work, have it restate the task in its own words and wait for confirmation. This surfaces misinterpretations before they become wrong actions.
For context bleed: Audit what's in your system prompt and conversation context. Use a token counter to see how much of the context window is occupied by different sources. If documents in the context are influencing behavior unintentionally, move them to explicit tool calls (the agent retrieves them on demand) instead of putting them in the initial context.
For tool call failures: Add explicit error handling in your tool wrapper functions. Return structured error messages that give the model enough information to decide what to do next. "Function failed: database connection error, connection refused on port 5432" is much more actionable than a raw exception traceback.
For hallucinated tool calls: This is usually a sign that your tool definitions are ambiguous or your system prompt is confusing. Clarify tool purposes in their descriptions. Add a "I will only call tools that are explicitly listed in my toolkit" line to your system prompt.
For planning failures: In complex multi-step tasks, add checkpoints where the agent verifies its intermediate results before proceeding. This slows down execution but catches wrong-path behavior before it propagates through multiple steps.
Building a regression test suite
Once you've debugged a failure and fixed it, you want to make sure it doesn't come back. This means capturing the failing example as a test case.
The structure is simple: a test case is a record of the input (query, context, conversation history), the expected behavior (not necessarily the exact output, but a description of what a correct response looks like), and optionally a failing example to avoid.
You don't need a fancy framework. A directory of JSONL files works:
{
"id": "test_auth_token_refresh",
"input": {
"query": "The user's session has expired. Refresh their token.",
"context": "User ID: 12345, token expires at: [past timestamp]"
},
"expected": {
"calls_tool": "refresh_user_token",
"tool_args_include": {"user_id": "12345"},
"does_not": "return expired token unchanged"
}
}
Running these periodically after prompt changes or model updates catches regressions early. Most teams don't do this systematically, which is why they keep re-encountering the same agent failures after making unrelated changes.
Agent debugging is a skill that improves with practice. The first time an agent goes wrong in production, the problem feels opaque. After a few debugging sessions with proper logging and observability in place, the pattern recognition gets better. You start to recognize context bleed vs. tool call failure vs. planning breakdown from the symptoms alone.
The investment in logging and observability infrastructure pays off quickly. Start with structured JSONL logging for development. Add Langfuse or LangSmith once you're in production. Keep your prompt traces so you can replay failures. And build your regression test suite as you go.