AI Agent Token Costs: How to Cut LLM Spend in Production

April 5, 2026 · Editorial Team · 7 min read · ai-engineering token-costs prompt-caching

Running an AI agent in production costs more than people expect when they're prototyping. In development, you're making a few dozen calls per day. In production, the same agent might make tens of thousands. The token math that seemed irrelevant at small scale becomes the biggest line item in your infrastructure budget.

I've worked through cost optimization on several production agent deployments, and the reductions are usually larger than expected. A well-optimized agent can cost 70-85% less to run than an unoptimized one doing the same work. Here's how.

Understanding where the tokens actually go

Before optimizing, measure. Most teams that haven't instrumented their agents are surprised when they break down actual token usage by category.

A typical poorly-optimized agent might look like this for a single task completion:

System prompt: 2,400 tokens (sent on every call)
Tool definitions: 1,800 tokens (sent on every call)
Conversation history and observations: 4,200 tokens (grows per step)
User input: 150 tokens
Output: 800 tokens

That's 9,350 tokens for a single step. If the task takes 12 steps to complete, that's roughly 112,000 tokens per task. At Claude 4 Sonnet pricing ($3 input / $15 output): roughly $0.34 per task in input costs plus $0.14 in output costs, or about $0.48 per task completion. Run that 5,000 times per month and you're at $2,400/month for one agent.

The question is: which of those token categories can be reduced without degrading performance?

Prompt caching: the highest-impact optimization

Anthropic's prompt caching is the single most impactful optimization for agents that send a large, consistent system prompt on every call.

How it works: you mark sections of your prompt with a cache control flag. Anthropic caches those sections on their infrastructure for up to 5 minutes (extended to 1 hour with the persistent cache option). On subsequent calls that include the same cached prefix, you're charged at 10% of the normal input token rate for the cached portion, and the cache read costs 10 tokens per 1,000 cached tokens instead of the full input rate.

The effective discount on cached tokens is 90%. That's not a small number.

In the example above, the system prompt (2,400 tokens) and tool definitions (1,800 tokens) are the same on every call. Total: 4,200 tokens per call, all cacheable. Without caching: 4,200 tokens at $3/M = $0.0126 per call. With caching after the first call: effectively $0.00126 per call for those tokens. Across 12 steps: saves about $0.133 per task completion, or roughly 28% of total cost.

The implementation for Anthropic's API looks like this:

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": LARGE_SYSTEM_CONTEXT,
                "cache_control": {"type": "ephemeral"}
            },
            {
                "type": "text",
                "text": user_message
            }
        ]
    }
]

The cache hit rate depends on call frequency. For agents making multiple calls per minute, you'll see hit rates above 90%. For agents that run once per hour, you may need the persistent cache option (available for context sections over 2,048 tokens in some API tiers).

OpenAI offers an equivalent feature called Prompt Caching (rolled out in late 2024), with a 50% discount on cached input tokens rather than Anthropic's 90%.

Context compression: keeping the conversation window small

In multi-step agentic loops, the conversation history grows with every tool call and observation. By step 10, you're including the output of 9 tool calls in your context, even if most of that information is no longer relevant to the current step.

Context compression means periodically summarizing older conversation history to reduce its token footprint. Instead of including the full output of a web search from step 3, you include a 2-sentence summary of what was found and what decision it informed.

Two implementation approaches:

Rolling summary: After every N steps (typically 3-5), have the model produce a concise summary of what's happened so far and what still needs to happen. Replace the detailed step history with this summary. The main risk: important details in early steps get lost in summarization. Mitigate by keeping a separate structured record of key decisions (e.g., "file path identified: /src/components/Auth.tsx") that doesn't get summarized away.

Selective pruning: Keep full detail for recent steps (last 3-4) and prune older steps down to one line each. Less aggressive than rolling summarization but still meaningful savings. Easier to implement without losing important context.

Typical savings: reducing 8,000 tokens of accumulated history to 1,500 tokens of compressed context saves 6,500 tokens per subsequent call. Over the remaining steps in a 15-step task, that's 65,000 saved input tokens. At $3/M, about $0.20 per task.

Model routing: use cheaper models for simpler steps

Not every step in an agent's loop needs a frontier model. Many steps are straightforward: call a tool based on an obvious condition, extract a specific field from structured output, decide between two clearly differentiated options. These don't need Claude 4 Opus or GPT-5.

Model routing means classifying each step by complexity and routing it to the cheapest model that can handle it reliably.

A practical three-tier routing setup:

Claude 3.5 Haiku ($0.25 input / $1.25 output per million tokens): Tool call formatting, structured data extraction, simple classification, regex-like tasks where the output format is fully defined.
Claude 3.5 Sonnet ($3 / $15 per million tokens): Most standard agent tasks: understanding natural language inputs, generating structured outputs, making moderate reasoning decisions.
Claude 4 Opus ($15 / $75 per million tokens): Complex reasoning steps, ambiguous situations requiring judgment, long-document analysis, tasks where Sonnet has shown poor performance in testing.

The routing logic can itself be a small, cheap model call that classifies the current step as "simple," "medium," or "complex" and returns the model name to use. Or it can be rule-based: certain tool types always go to Haiku; steps after an error go to Sonnet or above.

In a typical agent workload, if 40% of steps can be handled by Haiku, 50% by Sonnet, and 10% require Opus, you're paying for very little Opus while maintaining Opus-level quality where it actually matters. Compared to running everything on Opus, this could reduce cost by 75-80%.

Output token caps

Output tokens cost more than input tokens on every major API. Claude 4 Opus output is $75/M, five times the input rate. GPT-5 output is $40/M, four times its input rate. Reducing unnecessary verbosity in outputs is direct savings.

Two levers:

System prompt instruction on output length: "Be concise. Respond with exactly what's needed, no preamble or recap. Maximum 200 words unless more is explicitly required." This sounds trivial but makes a real difference. Unconstrained models add preamble ("Certainly! Let me help you with that..."), recap previous steps, and pad responses. A simple length constraint can reduce output tokens by 30-40% with no quality loss.

Max tokens parameter: Every API lets you set a maximum output token count. If your use case requires outputs under 500 tokens, set max_tokens: 600. This prevents runaway outputs from edge cases and also signals to the model that it should be brief.

For tool call outputs specifically: the model often produces verbose reasoning before the actual tool call JSON. You can instruct it to produce the tool call directly without preceding reasoning for simple steps. Save the reasoning output for complex decisions where it actually helps.

Putting it together: a real before/after

An agent I optimized was running invoice processing: receive invoice PDFs, extract line items, validate against purchase orders, flag discrepancies, and update a database. Before optimization, the agent was costing $0.82 per invoice processed.

Changes made:

Cached the system prompt (800 tokens) and tool definitions (1,200 tokens): cache hit rate 95%. Savings: $0.03/invoice.
Compressed tool call history after step 3: average history reduced from 6,200 tokens to 1,100 tokens per call. Savings: $0.11/invoice.
Routed extraction steps to Claude 3.5 Haiku, validation steps to Sonnet, discrepancy reasoning to Sonnet. Only escalated to Opus for invoices flagged as complex (about 8% of volume). Savings: $0.31/invoice.
Added max_tokens: 400 and a conciseness instruction. Output token reduction: ~35%. Savings: $0.09/invoice.

Total after optimization: $0.28/invoice. A 66% reduction. At 10,000 invoices per month, that went from $8,200/month to $2,800/month.

The order of priority: model routing and prompt caching have the most impact per hour of engineering work. Context compression is the next priority. Output constraints are quick wins that are worth doing early.

What not to optimize prematurely

One warning: cost optimization can degrade agent quality in ways that aren't obvious until they fail in production.

Model routing introduces the risk of routing a hard problem to a cheap model. If your routing logic is wrong about what's "simple," you'll see agent failures on cases that seem like they should work. Start with conservative routing thresholds and measure quality metrics alongside cost metrics.

Aggressive context compression loses information. I've seen agents that compress context too aggressively end up forgetting that a constraint was specified in the first few steps, producing an output that violates it. Measure task completion quality, not just cost, after implementing compression.

The right sequence: first make the agent work reliably, then optimize cost while monitoring quality metrics. Cost optimization that breaks the agent isn't optimization, it's a different kind of expensive.