AI Agent Caching Strategies: Prompt Cache, Semantic Cache, Real Numbers

April 12, 2026 · Editorial Team · 8 min read · ai-agents caching cost-optimization

AI API costs add up fast in production. A naive implementation that sends the full prompt on every request, with no caching, can easily spend 10x more than a well-optimized one doing the same work. Three caching layers are available to production agents, and they're not mutually exclusive.

This guide covers all three with real numbers and real code.

The three caching layers

Prompt caching (provider-side). Anthropic and Google both offer this. You mark parts of your prompt as cacheable. On the first request, the provider processes and caches those tokens. On subsequent requests that use the same cached content, you pay a fraction of the normal input token cost. With Anthropic, cached tokens cost $0.30 per million input tokens versus $3.00 for uncached. That's 90% off.

Semantic caching (application-side). Store the responses to previous requests. When a new request is semantically similar to a previous one, return the cached response without calling the API at all. No API cost, no latency.

Response caching (application-side). A simpler variant: cache exact responses to exact inputs. When the inputs are identical, skip the API call entirely. This works for deterministic use cases where the same question always gets the same answer.

Each layer has different tradeoffs on setup cost, hit rate, and freshness requirements.

Anthropic prompt caching: the 90% discount

Anthropic's prompt cache is the most impactful single optimization for agents that use long system prompts, large document contexts, or frequently repeated tool definitions.

Here's how it works in practice. You add cache_control: {"type": "ephemeral"} to specific content blocks. The first request processes and caches those blocks. Requests within the cache TTL (5 minutes for ephemeral caches) that send the same content hit the cache. You pay $0.30/MTok for cache reads versus $3.00/MTok for regular input tokens.

The minimum cacheable size is 1,024 tokens. Caching small prompts doesn't work.

A real example with the system prompt:

import anthropic

client = anthropic.AsyncAnthropic()

LARGE_SYSTEM_PROMPT = """
You are an expert code reviewer...
[2,000+ words of detailed instructions, examples, and guidelines]
"""

async def review_code(code: str) -> str:
    response = await client.messages.create(
        model="claude-3-7-sonnet-20250219",
        max_tokens=1000,
        system=[
            {
                "type": "text",
                "text": LARGE_SYSTEM_PROMPT,
                "cache_control": {"type": "ephemeral"}  # Mark for caching
            }
        ],
        messages=[
            {"role": "user", "content": f"Review this code:\n\n{code}"}
        ]
    )
    
    # Check cache performance in the response
    usage = response.usage
    print(f"Input tokens: {usage.input_tokens}")
    print(f"Cache creation tokens: {getattr(usage, 'cache_creation_input_tokens', 0)}")
    print(f"Cache read tokens: {getattr(usage, 'cache_read_input_tokens', 0)}")
    
    return response.content[0].text

On the first call, you see cache_creation_input_tokens equal to the length of your system prompt (you pay 25% more to create the cache on this call). On subsequent calls within 5 minutes, you see cache_read_input_tokens instead, at 90% off.

Real numbers from a production code review agent:

Scenario	Without caching	With prompt cache	Savings
System prompt: 2,000 tokens	$3.00/MTok	$0.30/MTok	90%
1,000 requests/day	~$6.00/day in system prompt tokens	~$0.60/day	$1,980/year
System prompt: 8,000 tokens	$24.00/day at 1K req	~$2.40/day	~$7,800/year

For any agent that uses a detailed system prompt and handles volume, this is the highest-ROI optimization available.

Caching document context

When your agent processes the same documents repeatedly (RAG contexts, code files, reference documents), cache those too:

async def analyze_document_repeatedly(
    document_content: str,
    questions: list[str]
) -> list[str]:
    responses = []
    
    for i, question in enumerate(questions):
        is_first_question = (i == 0)
        
        response = await client.messages.create(
            model="claude-3-7-sonnet-20250219",
            max_tokens=500,
            messages=[
                {
                    "role": "user",
                    "content": [
                        # Cache the document on first use
                        {
                            "type": "text",
                            "text": f"Document:\n{document_content}",
                            "cache_control": {"type": "ephemeral"}
                        },
                        {
                            "type": "text",
                            "text": f"\nQuestion: {question}"
                        }
                    ]
                }
            ]
        )
        responses.append(response.content[0].text)
    
    return responses

On the first question, the document content gets cached. On questions 2 through N, the document content hits the cache. If your document is 10,000 tokens and you ask 20 questions, you pay full price for one read and cache-read price for 19 reads.

Without caching: 200,000 input tokens at $3.00/MTok = $0.60 With caching: 10,000 full-price + 190,000 cached = $0.03 + $0.057 = $0.087

That's an 85% cost reduction for this specific use case.

Caching tool definitions

Agents with many tools pay the tool definition cost on every request. These definitions are often thousands of tokens. Cache them:

TOOLS = [
    {
        "name": "search_database",
        "description": "Search the customer database...",
        # [Long description with examples, 500+ tokens total per tool]
    },
    # ... 10 more tools
]

async def agent_call_with_cached_tools(
    conversation_history: list[dict]
) -> anthropic.Message:
    
    # Mark the last tool as the cache point
    # Anthropic caches everything up to the last cache_control marker
    tools_with_cache = TOOLS.copy()
    if tools_with_cache:
        tools_with_cache[-1] = {
            **tools_with_cache[-1],
            "cache_control": {"type": "ephemeral"}
        }
    
    return await client.messages.create(
        model="claude-3-7-sonnet-20250219",
        max_tokens=1000,
        tools=tools_with_cache,
        messages=conversation_history
    )

If your 10 tools total 5,000 tokens, caching them saves $0.0135 per request at Tier 3 pricing. At 10,000 requests per day, that's $135/day, $49,275/year.

Semantic caching

Semantic caching is for when you want to skip the API call entirely when a similar question has been asked before. Two requests are "similar" if their embedding vectors are close in semantic space.

The tradeoff: slightly stale or approximate answers in exchange for zero API cost and near-zero latency.

import json
import hashlib
import numpy as np
from typing import Optional

class SemanticCache:
    def __init__(
        self,
        embedding_client,
        redis_client,
        similarity_threshold: float = 0.92,
        ttl_seconds: int = 3600
    ):
        self.embedding_client = embedding_client
        self.redis = redis_client
        self.threshold = similarity_threshold
        self.ttl = ttl_seconds
    
    async def get_embedding(self, text: str) -> list[float]:
        response = await self.embedding_client.embeddings.create(
            model="text-embedding-3-small",  # $0.02/MTok, very cheap
            input=text
        )
        return response.data[0].embedding
    
    async def lookup(self, query: str) -> Optional[str]:
        query_embedding = await self.get_embedding(query)
        
        # Search for similar cached queries
        # In production, use a vector database (Pinecone, pgvector, Qdrant)
        # This simplified version scans all cached embeddings
        cache_keys = await self.redis.keys("semantic_cache:embedding:*")
        
        best_similarity = 0.0
        best_response_key = None
        
        for key in cache_keys:
            stored_embedding = json.loads(await self.redis.get(key))
            similarity = self._cosine_similarity(query_embedding, stored_embedding)
            
            if similarity > best_similarity:
                best_similarity = similarity
                best_response_key = key.decode().replace(
                    "semantic_cache:embedding:",
                    "semantic_cache:response:"
                )
        
        if best_similarity >= self.threshold and best_response_key:
            cached_response = await self.redis.get(best_response_key)
            if cached_response:
                return cached_response.decode()
        
        return None
    
    async def store(self, query: str, response: str) -> None:
        query_hash = hashlib.sha256(query.encode()).hexdigest()[:16]
        embedding = await self.get_embedding(query)
        
        embedding_key = f"semantic_cache:embedding:{query_hash}"
        response_key = f"semantic_cache:response:{query_hash}"
        
        await self.redis.setex(embedding_key, self.ttl, json.dumps(embedding))
        await self.redis.setex(response_key, self.ttl, response)
    
    def _cosine_similarity(
        self,
        a: list[float],
        b: list[float]
    ) -> float:
        a_arr = np.array(a)
        b_arr = np.array(b)
        return float(
            np.dot(a_arr, b_arr) / 
            (np.linalg.norm(a_arr) * np.linalg.norm(b_arr))
        )

# Usage:
cache = SemanticCache(
    embedding_client=openai.AsyncOpenAI(),
    redis_client=redis_client,
    similarity_threshold=0.92
)

async def answer_with_semantic_cache(question: str) -> str:
    # Check cache first
    cached = await cache.lookup(question)
    if cached:
        return cached
    
    # Cache miss: call the API
    response = await anthropic_client.messages.create(
        model="claude-3-7-sonnet-20250219",
        max_tokens=500,
        messages=[{"role": "user", "content": question}]
    )
    answer = response.content[0].text
    
    # Store in cache
    await cache.store(question, answer)
    return answer

Choosing the similarity threshold. 0.92 is a reasonable starting point for question-answering use cases. Higher (0.95+) means fewer cache hits but higher accuracy. Lower (0.85) means more hits but potentially returning a response from a question that's similar but not identical in meaning.

For customer support agents where accuracy matters, use 0.93+. For FAQ-style content where questions are genuinely repetitive, 0.88-0.90 gives better cache hit rates without significant accuracy loss.

In production, use a proper vector database. The Redis scan above is for illustration. For thousands of cached entries, use pgvector (if you're already on Postgres), Qdrant ($0/month self-hosted, $25/month cloud), or Pinecone.

Response caching (exact match)

For the simplest case: when the same exact input always gets the same output, just cache the whole thing.

import hashlib
import json

class ResponseCache:
    def __init__(self, redis_client, default_ttl: int = 3600):
        self.redis = redis_client
        self.default_ttl = default_ttl
    
    def _cache_key(self, model: str, messages: list, **kwargs) -> str:
        # Create a deterministic hash of the full request
        request_data = {
            "model": model,
            "messages": messages,
            **kwargs
        }
        request_str = json.dumps(request_data, sort_keys=True)
        return f"response_cache:{hashlib.sha256(request_str.encode()).hexdigest()}"
    
    async def get(self, model: str, messages: list, **kwargs) -> Optional[str]:
        key = self._cache_key(model, messages, **kwargs)
        cached = await self.redis.get(key)
        return json.loads(cached) if cached else None
    
    async def set(
        self,
        model: str,
        messages: list,
        response: str,
        ttl: Optional[int] = None,
        **kwargs
    ) -> None:
        key = self._cache_key(model, messages, **kwargs)
        await self.redis.setex(
            key,
            ttl or self.default_ttl,
            json.dumps(response)
        )

Exact-match response caching is most useful for:

Content generation with fixed templates (email responses, report sections)
Classification tasks where the same text gets the same label
Translation of known strings
Any deterministic transformation

It's not useful for conversational agents where every response is contextual, or for tasks where freshness matters.

Combining all three layers

The full caching stack for a production agent:

async def production_agent_call(
    conversation_history: list[dict],
    response_cache: ResponseCache,
    semantic_cache: SemanticCache,
    anthropic_client
) -> str:
    current_question = conversation_history[-1]["content"]
    
    # Layer 1: Exact response cache
    exact_cached = await response_cache.get(
        model="claude-3-7-sonnet-20250219",
        messages=conversation_history
    )
    if exact_cached:
        return exact_cached  # Free, <1ms
    
    # Layer 2: Semantic cache (single-turn questions only)
    if len(conversation_history) == 1:
        semantic_cached = await semantic_cache.lookup(current_question)
        if semantic_cached:
            return semantic_cached  # $0.00002 for embedding, <50ms
    
    # Layer 3: API call with prompt cache
    response = await anthropic_client.messages.create(
        model="claude-3-7-sonnet-20250219",
        max_tokens=1000,
        system=[{
            "type": "text",
            "text": SYSTEM_PROMPT,
            "cache_control": {"type": "ephemeral"}  # Provider-side cache
        }],
        messages=conversation_history
    )
    
    answer = response.content[0].text
    
    # Store results for future requests
    await response_cache.set(
        model="claude-3-7-sonnet-20250219",
        messages=conversation_history,
        response=answer
    )
    if len(conversation_history) == 1:
        await semantic_cache.store(current_question, answer)
    
    return answer

This gives you three tiers of decreasing speed and increasing cost: exact cache (free), semantic cache (fraction of a cent), API with prompt cache (90% off the prompt tokens).

Measuring cache effectiveness

Track these metrics to understand your cache ROI:

@dataclass
class CacheMetrics:
    exact_hits: int = 0
    semantic_hits: int = 0
    prompt_cache_hits_tokens: int = 0
    api_calls: int = 0
    total_input_tokens_saved: int = 0
    estimated_cost_saved_usd: float = 0.0

async def update_metrics(
    metrics: CacheMetrics,
    api_response: Optional[anthropic.Message]
) -> None:
    if api_response:
        usage = api_response.usage
        cache_read = getattr(usage, 'cache_read_input_tokens', 0)
        metrics.prompt_cache_hits_tokens += cache_read
        
        # $3.00/MTok uncached vs $0.30/MTok cached = $2.70 savings per MTok
        metrics.estimated_cost_saved_usd += cache_read * 2.70 / 1_000_000

A well-tuned caching layer in a production agent typically delivers:

Exact cache: 20-40% hit rate for repetitive tasks
Semantic cache: 15-30% hit rate for FAQ-style agents
Prompt cache: 70-95% cache read rate for large system prompts

Combined, these can reduce per-request costs by 60-80% versus no caching. For an agent spending $1,000/month on API calls today, that's $600-$800/month in savings with roughly 2-4 days of implementation work.

For the rate limiting patterns that work alongside caching (caching reduces volume, but you still need rate limiting for spikes), the rate limiting guide covers token buckets and adaptive concurrency. And for the full Anthropic prompt caching documentation including extended cache TTLs (available on higher tiers), the Anthropic docs have the complete reference.