Prompt Caching
Prompt caching reuses computation on repeated prompt prefixes — system instructions, large docs, tool definitions — so you pay less and respond faster on subsequent requests.
Last reviewed: June 2026
Cache pricing, TTLs, and API fields change. Verify Anthropic prompt caching and OpenAI pricing for current rates.
The problem
Every chat request resends the full context: system prompt, RAG chunks, tool schemas, conversation history. Long static prefixes dominate token cost. Caching stores the KV state for identical prefix segments.
High cache hit rate when:
- System prompt is stable across requests
- Large reference docs prepended unchanged
- Tool definitions fixed per deployment
- Multi-turn chat reuses same system + tools block
Low hit rate when:
- Prefix changes every request (dynamic timestamps, random IDs)
- User-specific docs in the system block
- Frequent tool schema edits
What belongs in the cached prefix
| Cache (static) | Do not cache (dynamic) |
|---|---|
| System instructions | User message |
| Tool JSON schemas | Conversation tail |
| Product docs / policies | RAG chunks that change per query |
| Few-shot examples (fixed) | Current timestamp unless needed |
Structure prompts: static block first, dynamic content last.
Anthropic cache_control
Mark cacheable content blocks with cache_control:
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: [
{
type: "text",
text: LONG_SYSTEM_PROMPT,
cache_control: { type: "ephemeral" },
},
],
messages: [{ role: "user", content: userMessage }],
});
Monitor usage.cache_creation_input_tokens and cache_read_input_tokens in responses.
Details: Anthropic API.
OpenAI cached tokens
OpenAI automatically caches repeated prompt prefixes on supported models (verify minimum token threshold in docs — often ~1,024 tokens).
| Practice | Effect |
|---|---|
| Keep system prompt identical byte-for-byte | Higher cache hits |
| Put stable content at start of messages array | Matches provider prefix matching |
| Avoid tiny changes (version strings) in system block | One char change busts cache |
Check usage dashboards for cached_tokens in API responses.
RAG + caching
| Pattern | Approach |
|---|---|
| Fixed corpus per tenant | Cache system + corpus in prefix |
| Per-query retrieval | Cache system + tools only; append retrieved chunks after |
| Frequently asked docs | Cache doc block with cache_control |
See RAG for Codebases for retrieval design.
Production concerns
| Concern | What to do |
|---|---|
| Cost | Measure cache read vs creation tokens in staging |
| Latency | Cache reads reduce TTFT on long prefixes |
| TTL | Anthropic ephemeral cache expires (~5 min default — verify) |
| Invalidation | Bump cache key version when system prompt changes |
| Security | Do not cache user-specific secrets in shared prefix |
Log cache metrics in LLM Observability.
Coding agents (IDE)
IDE agents cache less predictably — rules and open files change per session. Prompt caching matters most for shipped product features and batch API jobs, not single Cursor sessions.