1. ai
  2. /building
  3. /prompt-caching

Prompt Caching

Prompt caching reuses computation on repeated prompt prefixes — system instructions, large docs, tool definitions — so you pay less and respond faster on subsequent requests.

Last reviewed: June 2026

Cache pricing, TTLs, and API fields change. Verify Anthropic prompt caching and OpenAI pricing for current rates.

The problem

Every chat request resends the full context: system prompt, RAG chunks, tool schemas, conversation history. Long static prefixes dominate token cost. Caching stores the KV state for identical prefix segments.

High cache hit rate when:

  • System prompt is stable across requests
  • Large reference docs prepended unchanged
  • Tool definitions fixed per deployment
  • Multi-turn chat reuses same system + tools block

Low hit rate when:

  • Prefix changes every request (dynamic timestamps, random IDs)
  • User-specific docs in the system block
  • Frequent tool schema edits

What belongs in the cached prefix

Cache (static)Do not cache (dynamic)
System instructionsUser message
Tool JSON schemasConversation tail
Product docs / policiesRAG chunks that change per query
Few-shot examples (fixed)Current timestamp unless needed

Structure prompts: static block first, dynamic content last.

Anthropic cache_control

Mark cacheable content blocks with cache_control:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: LONG_SYSTEM_PROMPT,
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: userMessage }],
});

Monitor usage.cache_creation_input_tokens and cache_read_input_tokens in responses.

Details: Anthropic API.

OpenAI cached tokens

OpenAI automatically caches repeated prompt prefixes on supported models (verify minimum token threshold in docs — often ~1,024 tokens).

PracticeEffect
Keep system prompt identical byte-for-byteHigher cache hits
Put stable content at start of messages arrayMatches provider prefix matching
Avoid tiny changes (version strings) in system blockOne char change busts cache

Check usage dashboards for cached_tokens in API responses.

RAG + caching

PatternApproach
Fixed corpus per tenantCache system + corpus in prefix
Per-query retrievalCache system + tools only; append retrieved chunks after
Frequently asked docsCache doc block with cache_control

See RAG for Codebases for retrieval design.

Production concerns

ConcernWhat to do
CostMeasure cache read vs creation tokens in staging
LatencyCache reads reduce TTFT on long prefixes
TTLAnthropic ephemeral cache expires (~5 min default — verify)
InvalidationBump cache key version when system prompt changes
SecurityDo not cache user-specific secrets in shared prefix

Log cache metrics in LLM Observability.

Coding agents (IDE)

IDE agents cache less predictably — rules and open files change per session. Prompt caching matters most for shipped product features and batch API jobs, not single Cursor sessions.