AI Agent Caching Strategies: Prompt Cache, Semantic Cache, Real Numbers
AI API costs add up fast in production. A naive implementation that sends the full prompt on every request, with no caching, can easily spend 10x more than a well-optimized one doing the same work. Three caching layers are available to production agents, and they're not mutually exclusive.
This guide covers all three with real numbers and real code.
The three caching layers
Prompt caching (provider-side). Anthropic and Google both offer this. You mark parts of your prompt as cacheable. On the first request, the provider processes and caches those tokens. On subsequent requests that use the same cached content, you pay a fraction of the normal input token cost. With Anthropic, cached tokens cost $0.30 per million input tokens versus $3.00 for uncached. That's 90% off.
Semantic caching (application-side). Store the responses to previous requests. When a new request is semantically similar to a previous one, return the cached response without calling the API at all. No API cost, no latency.
Response caching (application-side). A simpler variant: cache exact responses to exact inputs. When the inputs are identical, skip the API call entirely. This works for deterministic use cases where the same question always gets the same answer.
Each layer has different tradeoffs on setup cost, hit rate, and freshness requirements.
Anthropic prompt caching: the 90% discount
Anthropic's prompt cache is the most impactful single optimization for agents that use long system prompts, large document contexts, or frequently repeated tool definitions.
Here's how it works in practice. You add cache_control: {"type": "ephemeral"} to specific content blocks. The first request processes and caches those blocks. Requests within the cache TTL (5 minutes for ephemeral caches) that send the same content hit the cache. You pay $0.30/MTok for cache reads versus $3.00/MTok for regular input tokens.
The minimum cacheable size is 1,024 tokens. Caching small prompts doesn't work.
A real example with the system prompt:
import anthropic
client = anthropic.AsyncAnthropic()
LARGE_SYSTEM_PROMPT = """
You are an expert code reviewer...
[2,000+ words of detailed instructions, examples, and guidelines]
"""
async def review_code(code: str) -> str:
response = await client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=1000,
system=[
{
"type": "text",
"text": LARGE_SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Mark for caching
}
],
messages=[
{"role": "user", "content": f"Review this code:\n\n{code}"}
]
)
# Check cache performance in the response
usage = response.usage
print(f"Input tokens: {usage.input_tokens}")
print(f"Cache creation tokens: {getattr(usage, 'cache_creation_input_tokens', 0)}")
print(f"Cache read tokens: {getattr(usage, 'cache_read_input_tokens', 0)}")
return response.content[0].text
On the first call, you see cache_creation_input_tokens equal to the length of your system prompt (you pay 25% more to create the cache on this call). On subsequent calls within 5 minutes, you see cache_read_input_tokens instead, at 90% off.
Real numbers from a production code review agent:
| Scenario | Without caching | With prompt cache | Savings |
|---|---|---|---|
| System prompt: 2,000 tokens | $3.00/MTok | $0.30/MTok | 90% |
| 1,000 requests/day | ~$6.00/day in system prompt tokens | ~$0.60/day | $1,980/year |
| System prompt: 8,000 tokens | $24.00/day at 1K req | ~$2.40/day | ~$7,800/year |
For any agent that uses a detailed system prompt and handles volume, this is the highest-ROI optimization available.
Caching document context
When your agent processes the same documents repeatedly (RAG contexts, code files, reference documents), cache those too:
async def analyze_document_repeatedly(
document_content: str,
questions: list[str]
) -> list[str]:
responses = []
for i, question in enumerate(questions):
is_first_question = (i == 0)
response = await client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=500,
messages=[
{
"role": "user",
"content": [
# Cache the document on first use
{
"type": "text",
"text": f"Document:\n{document_content}",
"cache_control": {"type": "ephemeral"}
},
{
"type": "text",
"text": f"\nQuestion: {question}"
}
]
}
]
)
responses.append(response.content[0].text)
return responses
On the first question, the document content gets cached. On questions 2 through N, the document content hits the cache. If your document is 10,000 tokens and you ask 20 questions, you pay full price for one read and cache-read price for 19 reads.
Without caching: 200,000 input tokens at $3.00/MTok = $0.60 With caching: 10,000 full-price + 190,000 cached = $0.03 + $0.057 = $0.087
That's an 85% cost reduction for this specific use case.
Caching tool definitions
Agents with many tools pay the tool definition cost on every request. These definitions are often thousands of tokens. Cache them:
TOOLS = [
{
"name": "search_database",
"description": "Search the customer database...",
# [Long description with examples, 500+ tokens total per tool]
},
# ... 10 more tools
]
async def agent_call_with_cached_tools(
conversation_history: list[dict]
) -> anthropic.Message:
# Mark the last tool as the cache point
# Anthropic caches everything up to the last cache_control marker
tools_with_cache = TOOLS.copy()
if tools_with_cache:
tools_with_cache[-1] = {
**tools_with_cache[-1],
"cache_control": {"type": "ephemeral"}
}
return await client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=1000,
tools=tools_with_cache,
messages=conversation_history
)
If your 10 tools total 5,000 tokens, caching them saves $0.0135 per request at Tier 3 pricing. At 10,000 requests per day, that's $135/day, $49,275/year.
Semantic caching
Semantic caching is for when you want to skip the API call entirely when a similar question has been asked before. Two requests are "similar" if their embedding vectors are close in semantic space.
The tradeoff: slightly stale or approximate answers in exchange for zero API cost and near-zero latency.
import json
import hashlib
import numpy as np
from typing import Optional
class SemanticCache:
def __init__(
self,
embedding_client,
redis_client,
similarity_threshold: float = 0.92,
ttl_seconds: int = 3600
):
self.embedding_client = embedding_client
self.redis = redis_client
self.threshold = similarity_threshold
self.ttl = ttl_seconds
async def get_embedding(self, text: str) -> list[float]:
response = await self.embedding_client.embeddings.create(
model="text-embedding-3-small", # $0.02/MTok, very cheap
input=text
)
return response.data[0].embedding
async def lookup(self, query: str) -> Optional[str]:
query_embedding = await self.get_embedding(query)
# Search for similar cached queries
# In production, use a vector database (Pinecone, pgvector, Qdrant)
# This simplified version scans all cached embeddings
cache_keys = await self.redis.keys("semantic_cache:embedding:*")
best_similarity = 0.0
best_response_key = None
for key in cache_keys:
stored_embedding = json.loads(await self.redis.get(key))
similarity = self._cosine_similarity(query_embedding, stored_embedding)
if similarity > best_similarity:
best_similarity = similarity
best_response_key = key.decode().replace(
"semantic_cache:embedding:",
"semantic_cache:response:"
)
if best_similarity >= self.threshold and best_response_key:
cached_response = await self.redis.get(best_response_key)
if cached_response:
return cached_response.decode()
return None
async def store(self, query: str, response: str) -> None:
query_hash = hashlib.sha256(query.encode()).hexdigest()[:16]
embedding = await self.get_embedding(query)
embedding_key = f"semantic_cache:embedding:{query_hash}"
response_key = f"semantic_cache:response:{query_hash}"
await self.redis.setex(embedding_key, self.ttl, json.dumps(embedding))
await self.redis.setex(response_key, self.ttl, response)
def _cosine_similarity(
self,
a: list[float],
b: list[float]
) -> float:
a_arr = np.array(a)
b_arr = np.array(b)
return float(
np.dot(a_arr, b_arr) /
(np.linalg.norm(a_arr) * np.linalg.norm(b_arr))
)
# Usage:
cache = SemanticCache(
embedding_client=openai.AsyncOpenAI(),
redis_client=redis_client,
similarity_threshold=0.92
)
async def answer_with_semantic_cache(question: str) -> str:
# Check cache first
cached = await cache.lookup(question)
if cached:
return cached
# Cache miss: call the API
response = await anthropic_client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=500,
messages=[{"role": "user", "content": question}]
)
answer = response.content[0].text
# Store in cache
await cache.store(question, answer)
return answer
Choosing the similarity threshold. 0.92 is a reasonable starting point for question-answering use cases. Higher (0.95+) means fewer cache hits but higher accuracy. Lower (0.85) means more hits but potentially returning a response from a question that's similar but not identical in meaning.
For customer support agents where accuracy matters, use 0.93+. For FAQ-style content where questions are genuinely repetitive, 0.88-0.90 gives better cache hit rates without significant accuracy loss.
In production, use a proper vector database. The Redis scan above is for illustration. For thousands of cached entries, use pgvector (if you're already on Postgres), Qdrant ($0/month self-hosted, $25/month cloud), or Pinecone.
Response caching (exact match)
For the simplest case: when the same exact input always gets the same output, just cache the whole thing.
import hashlib
import json
class ResponseCache:
def __init__(self, redis_client, default_ttl: int = 3600):
self.redis = redis_client
self.default_ttl = default_ttl
def _cache_key(self, model: str, messages: list, **kwargs) -> str:
# Create a deterministic hash of the full request
request_data = {
"model": model,
"messages": messages,
**kwargs
}
request_str = json.dumps(request_data, sort_keys=True)
return f"response_cache:{hashlib.sha256(request_str.encode()).hexdigest()}"
async def get(self, model: str, messages: list, **kwargs) -> Optional[str]:
key = self._cache_key(model, messages, **kwargs)
cached = await self.redis.get(key)
return json.loads(cached) if cached else None
async def set(
self,
model: str,
messages: list,
response: str,
ttl: Optional[int] = None,
**kwargs
) -> None:
key = self._cache_key(model, messages, **kwargs)
await self.redis.setex(
key,
ttl or self.default_ttl,
json.dumps(response)
)
Exact-match response caching is most useful for:
- Content generation with fixed templates (email responses, report sections)
- Classification tasks where the same text gets the same label
- Translation of known strings
- Any deterministic transformation
It's not useful for conversational agents where every response is contextual, or for tasks where freshness matters.
Combining all three layers
The full caching stack for a production agent:
async def production_agent_call(
conversation_history: list[dict],
response_cache: ResponseCache,
semantic_cache: SemanticCache,
anthropic_client
) -> str:
current_question = conversation_history[-1]["content"]
# Layer 1: Exact response cache
exact_cached = await response_cache.get(
model="claude-3-7-sonnet-20250219",
messages=conversation_history
)
if exact_cached:
return exact_cached # Free, <1ms
# Layer 2: Semantic cache (single-turn questions only)
if len(conversation_history) == 1:
semantic_cached = await semantic_cache.lookup(current_question)
if semantic_cached:
return semantic_cached # $0.00002 for embedding, <50ms
# Layer 3: API call with prompt cache
response = await anthropic_client.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=1000,
system=[{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"} # Provider-side cache
}],
messages=conversation_history
)
answer = response.content[0].text
# Store results for future requests
await response_cache.set(
model="claude-3-7-sonnet-20250219",
messages=conversation_history,
response=answer
)
if len(conversation_history) == 1:
await semantic_cache.store(current_question, answer)
return answer
This gives you three tiers of decreasing speed and increasing cost: exact cache (free), semantic cache (fraction of a cent), API with prompt cache (90% off the prompt tokens).
Measuring cache effectiveness
Track these metrics to understand your cache ROI:
@dataclass
class CacheMetrics:
exact_hits: int = 0
semantic_hits: int = 0
prompt_cache_hits_tokens: int = 0
api_calls: int = 0
total_input_tokens_saved: int = 0
estimated_cost_saved_usd: float = 0.0
async def update_metrics(
metrics: CacheMetrics,
api_response: Optional[anthropic.Message]
) -> None:
if api_response:
usage = api_response.usage
cache_read = getattr(usage, 'cache_read_input_tokens', 0)
metrics.prompt_cache_hits_tokens += cache_read
# $3.00/MTok uncached vs $0.30/MTok cached = $2.70 savings per MTok
metrics.estimated_cost_saved_usd += cache_read * 2.70 / 1_000_000
A well-tuned caching layer in a production agent typically delivers:
- Exact cache: 20-40% hit rate for repetitive tasks
- Semantic cache: 15-30% hit rate for FAQ-style agents
- Prompt cache: 70-95% cache read rate for large system prompts
Combined, these can reduce per-request costs by 60-80% versus no caching. For an agent spending $1,000/month on API calls today, that's $600-$800/month in savings with roughly 2-4 days of implementation work.
For the rate limiting patterns that work alongside caching (caching reduces volume, but you still need rate limiting for spikes), the rate limiting guide covers token buckets and adaptive concurrency. And for the full Anthropic prompt caching documentation including extended cache TTLs (available on higher tiers), the Anthropic docs have the complete reference.