Prompt Caching for LLM APIs

Prompt caching reuses computation on repeated prompt prefixes — system instructions, large docs, tool definitions — so you pay less and respond faster on subsequent requests.

Last reviewed: June 2026

Cache pricing, TTLs, and API fields change. Verify Anthropic prompt caching and OpenAI pricing for current rates.

The problem

Every chat request resends the full context: system prompt, RAG chunks, tool schemas, conversation history. Long static prefixes dominate token cost. Caching stores the KV state for identical prefix segments.

High cache hit rate when:

System prompt is stable across requests
Large reference docs prepended unchanged
Tool definitions fixed per deployment
Multi-turn chat reuses same system + tools block

Low hit rate when:

Prefix changes every request (dynamic timestamps, random IDs)
User-specific docs in the system block
Frequent tool schema edits

What belongs in the cached prefix

Cache (static)	Do not cache (dynamic)
System instructions	User message
Tool JSON schemas	Conversation tail
Product docs / policies	RAG chunks that change per query
Few-shot examples (fixed)	Current timestamp unless needed

Structure prompts: static block first, dynamic content last.

Anthropic cache_control

Mark cacheable content blocks with cache_control:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

const response = await client.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  system: [
    {
      type: "text",
      text: LONG_SYSTEM_PROMPT,
      cache_control: { type: "ephemeral" },
    },
  ],
  messages: [{ role: "user", content: userMessage }],
});

Monitor usage.cache_creation_input_tokens and cache_read_input_tokens in responses.

Details: Anthropic API.

OpenAI cached tokens

OpenAI automatically caches repeated prompt prefixes on supported models (verify minimum token threshold in docs — often ~1,024 tokens).

Practice	Effect
Keep system prompt identical byte-for-byte	Higher cache hits
Put stable content at start of messages array	Matches provider prefix matching
Avoid tiny changes (version strings) in system block	One char change busts cache

Check usage dashboards for cached_tokens in API responses.

RAG + caching

Pattern	Approach
Fixed corpus per tenant	Cache system + corpus in prefix
Per-query retrieval	Cache system + tools only; append retrieved chunks after
Frequently asked docs	Cache doc block with `cache_control`

See RAG for Codebases for retrieval design.

Production concerns

Concern	What to do
Cost	Measure cache read vs creation tokens in staging
Latency	Cache reads reduce TTFT on long prefixes
TTL	Anthropic ephemeral cache expires (~5 min default — verify)
Invalidation	Bump cache key version when system prompt changes
Security	Do not cache user-specific secrets in shared prefix

Log cache metrics in LLM Observability.

Coding agents (IDE)

IDE agents cache less predictably — rules and open files change per session. Prompt caching matters most for shipped product features and batch API jobs, not single Cursor sessions.

Stop vibe-debugging.

Prompt Caching

The problem

What belongs in the cached prefix

Anthropic cache_control

OpenAI cached tokens

RAG + caching

Production concerns

Coding agents (IDE)

Stop vibe-debugging.

On this page

The problem

What belongs in the cached prefix

Anthropic cache_control

OpenAI cached tokens

RAG + caching

Production concerns

Coding agents (IDE)

Related