Agent Handoff Patterns in 2026: How AI Agents Delegate to Each Other
When an agent finishes its piece of work and passes control to another agent, a lot can go wrong. The most common failure I see in multi-agent systems isn't the individual agents doing their jobs badly, it's the transition points. The handoff drops context. The receiving agent doesn't understand the task state. The original intent gets diluted through three or four transfers and the final output barely resembles what was asked for.
This guide covers the main handoff patterns, what goes wrong at each, and concrete examples of how to implement them in ways that actually work at production scale.
What a handoff is (and what it's not)
An agent handoff is the transfer of control from one agent to another, along with the context needed for the receiving agent to continue the work.
A handoff is not just calling a tool. When Agent A calls a web search tool, that's a tool call, not a handoff. Agent A stays in control throughout. A handoff happens when Agent A's participation in this workflow is done and Agent B takes over.
The distinction matters because handoffs have state implications. Tools return results to the calling agent. Handoffs change who is responsible for the ongoing task.
The three handoff models
Model 1: Linear sequential handoff
The simplest pattern. Agent A completes its task and explicitly passes the result to Agent B, which completes its task and passes to Agent C, and so on.
from anthropic import Anthropic
client = Anthropic()
def run_research_agent(topic: str) -> dict:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=2048,
system="""You are a research specialist. Your job is to gather facts
about a given topic. Return a JSON object with keys:
'summary', 'key_facts' (list), 'sources' (list).""",
messages=[{"role": "user", "content": f"Research this topic: {topic}"}]
)
import json
return json.loads(response.content[0].text)
def run_writer_agent(research_output: dict, target_audience: str) -> str:
context = f"""Research findings:
Summary: {research_output['summary']}
Key facts: {', '.join(research_output['key_facts'])}"""
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=2048,
system=f"""You are a content writer for a {target_audience} audience.
Use the research provided to write a clear, accurate article.
Do not add facts not present in the research.""",
messages=[{"role": "user", "content": context}]
)
return response.content[0].text
def run_editor_agent(draft: str, style_guide: str) -> str:
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=2048,
system=f"""You are an editor. Review and improve the draft per
this style guide: {style_guide}
Return only the edited draft, no commentary.""",
messages=[{"role": "user", "content": f"Edit this draft:\n\n{draft}"}]
)
return response.content[0].text
# The handoff chain
research = run_research_agent("climate change impacts on agriculture")
draft = run_writer_agent(research, "general public")
final = run_editor_agent(draft, "short sentences, active voice, no jargon")
The critical design choice here is that each agent receives structured context, not raw previous output. The writer agent doesn't get the full research scratchpad; it gets the structured summary, key facts, and sources. The editor doesn't get the research at all; it just gets the draft to edit.
This matters because passing raw previous output bloats context and introduces noise. The research agent's scratchpad might include 400 tokens of reasoning about which sources to trust. None of that is useful to the writer.
Model 2: Supervisor-directed handoff
A supervisor agent decides which agent should handle each step. Rather than a predetermined sequence, the supervisor reads the task state and routes to the appropriate specialist.
import json
from anthropic import Anthropic
client = Anthropic()
AGENTS = {
"classifier": "You classify customer support requests into categories: billing, technical, refund, general.",
"billing_specialist": "You resolve billing questions. You have access to subscription pricing info.",
"tech_specialist": "You resolve technical issues. You troubleshoot product problems step by step.",
"escalation_agent": "You handle requests that need human escalation. You gather details and create a ticket summary."
}
def run_agent(agent_name: str, context: str) -> str:
system_prompt = AGENTS[agent_name]
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": context}]
)
return response.content[0].text
def supervisor(task: str, agent_results: list[dict]) -> dict:
history = "\n".join([f"{r['agent']}: {r['output'][:200]}..." for r in agent_results])
prompt = f"""Task: {task}
Work done so far:
{history if history else "None yet"}
Decide what to do next. Respond with JSON only:
{{"next_agent": "agent_name_or_DONE", "context_for_agent": "what to tell them", "reasoning": "brief"}}
Available agents: {list(AGENTS.keys())}"""
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=512,
messages=[{"role": "user", "content": prompt}]
)
return json.loads(response.content[0].text)
def run_pipeline(customer_request: str) -> str:
agent_results = []
for _ in range(5): # Max 5 steps; kill condition
decision = supervisor(customer_request, agent_results)
if decision["next_agent"] == "DONE":
return agent_results[-1]["output"] if agent_results else "No result."
agent_name = decision["next_agent"]
output = run_agent(agent_name, decision["context_for_agent"])
agent_results.append({"agent": agent_name, "output": output})
return "Escalated: maximum steps reached."
The max step limit on line for _ in range(5) is not optional. Supervisor agents can get into loops where they keep deciding to do more work that isn't necessary. Without a hard kill condition, you'll burn through API budget on infinite refinement loops.
Model 3: Peer handoff (collaborative)
In peer handoff, agents pass work back and forth without a supervisor. Agent A works on a problem, passes to Agent B for critique or a second pass, and Agent B either finishes or passes back.
This pattern is used for quality improvement: writer/editor pairs, coder/reviewer pairs, or researcher/fact-checker pairs.
def run_collaborative_pair(task: str, max_rounds: int = 3) -> str:
draft = None
for round_num in range(max_rounds):
if draft is None:
# First round: generate
draft = run_agent("writer", task)
else:
# Critique round
critique_prompt = f"""Review this draft for accuracy and clarity.
If it's acceptable, respond with "APPROVED: " followed by the final draft.
If it needs changes, respond with "REVISE: " followed by specific feedback.
Draft:
{draft}"""
critique = run_agent("editor", critique_prompt)
if critique.startswith("APPROVED:"):
return critique[len("APPROVED:"):].strip()
# Revision round
revision_prompt = f"""Revise this draft based on the feedback.
Original draft:
{draft}
Feedback:
{critique[len("REVISE:"):].strip()}"""
draft = run_agent("writer", revision_prompt)
return draft # Return best version after max rounds
The APPROVED: prefix is a deterministic exit condition. Without a clear way for the reviewing agent to signal "this is done," the loop will always run to max_rounds. Structured prefixes in agent outputs are a general pattern worth adopting: they let you parse agent intent without another LLM call.
Context passing: the hard part
The handoff code above handles control flow, but the real complexity is context. What does the receiving agent need to know to continue the work well?
What to pass:
- The original task or user intent (the goal, not just the previous output)
- The key results from previous steps (structured, not raw)
- Relevant constraints that should carry through the pipeline
- The current task state (what's been done, what's left)
What not to pass:
- Intermediate reasoning from previous agents (unless it's specifically relevant)
- Error messages that have already been handled
- The entire conversation history when a summary would do
Here's a pattern for structured context handoff:
from dataclasses import dataclass, field
from typing import Any
@dataclass
class HandoffContext:
original_task: str
intent: str # One-line summary of what user actually wants
completed_steps: list[str] = field(default_factory=list)
outputs: dict[str, Any] = field(default_factory=dict)
constraints: list[str] = field(default_factory=list)
def for_agent(self, agent_role: str) -> str:
"""Formats context for a specific receiving agent."""
relevant_outputs = {k: v for k, v in self.outputs.items()}
return f"""Task: {self.original_task}
Intent: {self.intent}
Constraints: {'; '.join(self.constraints) if self.constraints else 'None'}
Prior work: {'; '.join(self.completed_steps) if self.completed_steps else 'None'}
Available outputs: {list(relevant_outputs.keys())}
Your role: {agent_role}"""
The HandoffContext object travels with the task. Each agent reads it and writes back to it. No agent needs to reconstruct what has already been done.
When handoffs fail
Handoffs fail in specific, reproducible ways. Knowing the failure modes makes them easier to prevent.
Context loss. The receiving agent doesn't understand what it needs to do because the handoff didn't include enough of the original task context. The fix: always include the original intent, not just the previous output.
Goal drift. After three or four handoffs, the task the last agent is working on bears little resemblance to the original request. Each agent slightly reinterprets the task. The fix: pass the original task explicitly to every agent in the chain, not just the next agent.
Compounding errors. Agent A makes a mistake. Agent B receives that mistake as ground truth and builds on it. Agent C is now working with compounded errors. The fix: validate each agent's output before passing it forward. For numerical or factual claims, automated validation against a schema or fact-check layer is worth building.
Loop detection failure. A supervisor routes to Agent A, which routes back to the supervisor, which routes to Agent A again. Without a step counter and kill condition, this runs indefinitely. The fix: implement a max_steps kill condition in every orchestrator, and log every routing decision.
Context window overflow. The cumulative context from multiple agents fills the receiving agent's context window, causing it to ignore or truncate important earlier content. The fix: summarize, don't concatenate. Before passing context forward, compress prior agent outputs to the essential facts.
Testing handoffs in isolation
One debugging technique that saves significant time: test each handoff in isolation before testing the full pipeline.
For each handoff, create a set of representative inputs (what Agent N-1 typically produces) and validate that Agent N produces acceptable outputs from those inputs. Test edge cases: what happens when Agent N-1 produces malformed output? What happens when the context is empty?
Testing the full pipeline end-to-end is necessary, but it makes root cause analysis harder when something fails. If you know each individual handoff works correctly on the expected input range, you've localized the problem to the orchestration layer when the full pipeline fails.
Production checklist
Before deploying a multi-agent system with handoffs:
- Every orchestrator has a maximum step count with a fallback behavior
- Agent outputs are validated against expected schemas before being passed to the next agent
- Every handoff includes the original task/intent, not just the previous output
- Context objects are compressed before forwarding when they exceed a defined size
- All agent calls and handoff decisions are logged with timestamps and step counts
- There is a fallback for every agent (what happens if the API call fails?)
- The pipeline has been tested on a representative sample of real inputs, including edge cases
The handoff layer is infrastructure. It doesn't get the attention that agent capabilities do, but it's where production systems break in practice. Getting it right is the difference between a multi-agent system that works reliably and one that works only in the demos.