Prompt Caching Deep Dive: How to Cut Anthropic API Costs by 90%

April 18, 2026 · Editorial Team · 7 min read · anthropic claude api

Prompt caching is probably the most underused cost optimization available to teams building on Anthropic's API right now. The discount on cached tokens is 90% compared to standard input prices, and for applications with large system prompts or repeated context, the savings can be substantial enough to change whether a product is economically viable.

This post covers exactly how it works, how to implement it correctly, what cache hit rates are realistic, and what the actual dollar numbers look like for common use cases.

The basic mechanics

When you send a request to the Anthropic API, the model has to process every token in your input, including system prompt tokens, user message tokens, and any context you've injected. Processing tokens costs money. If you're sending the same large system prompt with every request (which most applications do), you're paying full price for those same tokens every single time.

Prompt caching lets you tell Anthropic "cache this portion of the prompt after you process it the first time." On subsequent requests that start with that same cached prefix, Anthropic reads the cached KV states instead of recomputing them from scratch. You pay a fraction of the cost for those tokens.

The pricing breakdown on Claude 3.5 Sonnet as of April 2026:

Standard input tokens: $3.00 per million
Cache write tokens (first time a prefix is cached): $3.75 per million (25% premium)
Cache read tokens (subsequent hits): $0.30 per million (90% discount vs standard)
Output tokens: $15.00 per million (no change)

The cache write premium is 25% over standard input. You pay slightly more the first time, then a fraction on every subsequent hit. For the economics to work, you need a sufficient number of cache hits per cache write.

How to implement it

Caching is controlled via the cache_control parameter in your API request. You add this to the content blocks in your system prompt or messages to mark where a cache boundary should be.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant for Acme Corp. [... 10,000 tokens of company docs ...]",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "What is our refund policy?"}
    ]
)

The cache_control: {type: "ephemeral"} marker tells Anthropic to cache everything up to and including that content block. On the next request, if your system prompt starts with the exact same text up to that marker, the cached KV states are used.

Two important things about how cache boundaries work:

Caching applies to a prefix, not arbitrary sections. The cached portion must be at the start of the input (or the start of a message role's content). You can't cache the middle of a message.
Cache duration is 5 minutes. If more than 5 minutes pass between two requests using the same cached prefix, the cache expires and the next request pays full write cost again to refresh it.

Multiple cache breakpoints

You can have up to four cache breakpoints in a single request. This is useful when you have a layered prompt structure: a stable system prompt, a somewhat-stable context document, and a frequently changing user-specific portion.

system=[
    {
        "type": "text",
        "text": STATIC_SYSTEM_PROMPT,
        "cache_control": {"type": "ephemeral"}  # Breakpoint 1: rarely changes
    },
    {
        "type": "text",
        "text": PRODUCT_CATALOG_CONTEXT,
        "cache_control": {"type": "ephemeral"}  # Breakpoint 2: updated daily
    },
    {
        "type": "text",
        "text": USER_SPECIFIC_PREFS,  # No cache_control: changes per user
    }
]

In this structure, if the static system prompt and product catalog are both cached, you're paying cache read prices for those two sections and full input prices only for the user-specific portion.

What makes a good cache boundary

Caching works when the text prefix is identical between requests. A single character difference breaks the cache match. This sounds finicky but in practice it's manageable.

Good cache candidates:

System prompt instructions that don't change between requests
RAG context documents loaded at application startup and used for many queries
Long few-shot example sets included in every prompt
Large tool definitions for function-calling applications
Static code files included in code review prompts

Poor cache candidates:

System prompts with dynamically injected timestamps or user names
Content that changes per request
Very short content (the savings don't justify the complexity below ~2,000 tokens)

The minimum cacheable content is 1,024 tokens. Anthropic won't cache shorter blocks even if you add the cache_control marker.

Real cost calculations

Let's put concrete numbers to this. Consider a customer support chatbot with:

System prompt: 4,000 tokens of instructions and company policies
RAG context: 6,000 tokens of relevant documents injected per query
Average user message: 50 tokens
Average output: 200 tokens
Volume: 10,000 queries per day

Without caching, per query:

Input: (4,000 + 6,000 + 50) tokens at $3.00/M = $0.0303
Output: 200 tokens at $15.00/M = $0.003
Total per query: $0.0333
Daily cost (10,000 queries): $333
Monthly cost: ~$10,000

With caching on the 10,000-token system + RAG context (assume 95% cache hit rate):

Per cache write (5% of queries):

Write tokens: 10,000 at $3.75/M = $0.0375
User message input: 50 at $3.00/M = $0.00015
Output: $0.003
Total: ~$0.0407

Per cache read (95% of queries):

Read tokens: 10,000 at $0.30/M = $0.003
User message input: 50 at $3.00/M = $0.00015
Output: $0.003
Total: ~$0.00615

Weighted average per query: (0.05 x $0.0407) + (0.95 x $0.00615) = $0.00787 Daily cost (10,000 queries): $78.70 Monthly cost: ~$2,361

Monthly savings: ~$7,640. That's a 76% reduction in total API cost. The saving would be even larger for applications with higher cached token ratios.

Realistic cache hit rates

The 95% hit rate I used above is achievable but not guaranteed. Cache hit rates depend on your traffic patterns and implementation.

Factors that increase hit rate:

High query volume sustained over time (cache refreshes stay warm)
System prompts with no dynamic elements
Application architecture that routes related queries together

Factors that decrease hit rate:

Low volume (cache expires between queries)
Dynamic content injected into cached sections
Diverse query patterns where different system prompts are used for different features

In practice, applications I've seen with well-implemented caching and moderate-to-high query volume hit 80-95% cache read rates on their static content. Low-volume applications (under 100 queries per hour) struggle to maintain warm caches and see much lower hit rates.

The API response includes usage statistics that tell you how many tokens were cache reads vs. cache writes vs. standard inputs. Use this to calculate your actual hit rate and adjust your architecture accordingly.

Tool definitions: an underused caching opportunity

One caching opportunity that gets overlooked is tool/function definitions. If you're building an agentic application with dozens of tool definitions, those definitions can be thousands of tokens that are included in every request. They change rarely, if ever.

Marking your tool definitions with cache_control keeps them warm in cache across requests. For a complex agentic application with, say, 40 tools averaging 300 tokens each (12,000 tokens total), caching the tool definitions alone saves $0.033 per cache read compared to standard input pricing on those 12,000 tokens. At 1,000 agentic calls per day, that's $33/day or roughly $1,000/month just from caching tool definitions.

Multi-turn conversation caching

Caching in multi-turn conversations requires thinking about how the conversation history grows. The common pattern is to cache the system prompt and any injected context as stable prefixes, then include the growing conversation history as uncached turns.

You can also mark earlier conversation turns with cache_control if the conversation has reached a stable checkpoint. For long-running agent sessions that might span hundreds of turns, periodically caching the conversation history up to a certain point prevents the uncached portion from growing unboundedly.

One approach for long agent sessions:

System prompt: cached (stable)
Initial context/documents: cached (stable)
First N turns of conversation: cached (after session reaches N turns)
Recent uncached turns: standard input pricing

The tradeoff is that you're paying cache write prices when you move turns from uncached to cached, but this is almost always worth it for long sessions.

Checking your usage

The API response includes a usage object with cache_creation_input_tokens and cache_read_input_tokens alongside standard input_tokens. Log these to track your actual cache efficiency:

print(f"Cache reads: {response.usage.cache_read_input_tokens}")
print(f"Cache writes: {response.usage.cache_creation_input_tokens}")
print(f"Standard input: {response.usage.input_tokens}")

cache_read_rate = (
    response.usage.cache_read_input_tokens /
    (response.usage.cache_read_input_tokens +
     response.usage.cache_creation_input_tokens +
     response.usage.input_tokens)
)
print(f"Cache read rate: {cache_read_rate:.1%}")

Build this into your monitoring from day one. If your cache read rate drops, it means something changed that's invalidating your cache prefix. Common culprits: a dynamic timestamp in the system prompt, a user ID accidentally injected into the cached section, or a change to the prompt that wasn't reflected in all code paths.

Prompt caching isn't complicated to implement. The main work is identifying which portions of your prompt are stable enough to cache, restructuring your prompt to put those stable portions first, and adding the cache_control markers. For most applications, that's a few hours of work. The cost reduction at moderate-to-high volumes more than justifies the time.

If you're building anything on the Anthropic API with a system prompt over 2,000 tokens and more than a few hundred daily API calls, you should be using prompt caching.