Cost, Latency, and Tokens
Tokens are the metered unit for cloud LLMs. Ignoring cost works until the invoice arrives.
Token Basics
| Term | Meaning |
|---|---|
| Input tokens | Everything you send — system prompt, history, RAG chunks |
| Output tokens | Model response (often priced higher) |
| Context window | Max input + output combined |
| Cached tokens | Re-used input at reduced cost (provider-dependent) |
Rough rule: 1 token ≈ 4 characters in English (varies by language and tokenizer).
Model Selection
| Tier | Examples (2026) | Use when |
|---|---|---|
| Fast / cheap | Haiku, GPT-4o-mini, small open models | Classification, routing, simple edits |
| Balanced | Sonnet, GPT-4o | Most coding and chat features |
| Premium | Opus, o3, o4-mini (reasoning) | Hard bugs, architecture, low-volume critical tasks |
For dev tooling: use premium models for Plan/review, cheaper models for boilerplate if your tool supports it.
For production chat: default to balanced; escalate to premium only when needed.
Prompt Caching
When the same large system prompt (docs, context, instructions) repeats across requests, caching it cuts input token cost by ~90% on cache hits.
Anthropic — mark stable content with cacheControl:
import { anthropic } from "@ai-sdk/anthropic";
import { streamText } from "ai";
// First request indexes the system prompt (~5 min TTL).
// Subsequent requests with the same prompt hit the cache.
const result = streamText({
model: anthropic("claude-sonnet-4-20250514"),
system: `You are a support agent.\n\n${longProductDocs}`, // stable, cache candidate
messages,
providerOptions: {
anthropic: { cacheControl: { type: "ephemeral" } },
},
});
OpenAI — prompt caching is automatic for prompts ≥1,024 tokens that share a common prefix. The API response includes usage.prompt_tokens_details.cached_tokens:
import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";
const result = streamText({
model: openai("gpt-4o"),
// OpenAI caches automatically when the same prefix recurs within ~1 hour
system: `You are a support agent.\n\n${longProductDocs}`,
messages,
});
// Inspect cache hit rate from usage (non-streaming generateText):
// result.usage.providerMetadata?.openai?.cachedPromptTokens
Reference: Anthropic prompt caching · OpenAI prompt caching.
History Truncation
Every message in the conversation history adds input tokens. Without truncation, a long chat session becomes expensive and slow.
// lib/truncate-messages.ts
import type { CoreMessage } from "ai";
/**
* Keep the last N messages plus any system context.
* For production: summarize dropped turns instead of discarding them.
*/
export function truncateHistory(
messages: CoreMessage[],
maxMessages = 20
): CoreMessage[] {
if (messages.length <= maxMessages) return messages;
// Always keep an even number so user/assistant turns stay paired
const keep = maxMessages % 2 === 0 ? maxMessages : maxMessages - 1;
return messages.slice(-keep);
}
// app/api/chat/route.ts
export async function POST(req: Request) {
const { messages } = await req.json();
const result = streamText({
model: anthropic("claude-sonnet-4-20250514"),
messages: truncateHistory(messages, 20),
maxTokens: 1024,
});
return result.toDataStreamResponse();
}
For long-running assistants, summarize the dropped history rather than discarding it: "Summarize the first 10 turns in 3 sentences" → inject summary as a system message.
Capping Output
Always set maxTokens on user-facing routes. Uncapped streaming can run for minutes and drive up costs.
const result = streamText({
model: openai("gpt-4o"),
messages,
maxTokens: 1024, // hard cap on response length
temperature: 0.7,
});
For reasoning models (o3, o4-mini, Claude with extended thinking), budget separately — thinking tokens count toward output cost.
Request Logging Middleware
Log token counts per request to track spend and detect runaway loops or abuse:
// lib/llm-logger.ts
export interface LLMLogEntry {
timestamp: string;
model: string;
inputTokens: number;
outputTokens: number;
cachedTokens?: number;
latencyMs: number;
userId?: string;
sessionId?: string;
}
export function logLLMUsage(entry: LLMLogEntry) {
// Replace with your observability stack (Datadog, Axiom, custom DB)
console.log(JSON.stringify({ event: "llm_request", ...entry }));
}
// app/api/chat/route.ts — wire up after streamText resolves
export async function POST(req: Request) {
const { messages, sessionId } = await req.json();
const start = Date.now();
const result = streamText({
model: anthropic("claude-sonnet-4-20250514"),
messages: truncateHistory(messages),
maxTokens: 1024,
onFinish: ({ usage }) => {
logLLMUsage({
timestamp: new Date().toISOString(),
model: "claude-sonnet-4-20250514",
inputTokens: usage.promptTokens,
outputTokens: usage.completionTokens,
latencyMs: Date.now() - start,
sessionId,
});
},
});
return result.toDataStreamResponse();
}
Alert on input token spikes (retry loops) or output token spikes (runaway generation).
Cost Control Tactics
| Tactic | Savings |
|---|---|
| Prompt caching | ~90% on cached input tokens for stable system prompts |
| Truncate history | Reduces input tokens per turn in long sessions |
| Smaller RAG chunks | Fewer input tokens per query |
maxTokens cap | Prevents runaway outputs |
| Rate limits per user | Prevent abuse before provider org limits kick in |
| Batch API for offline jobs | ~50% cheaper than interactive endpoint (Anthropic, OpenAI) |
| Route to cheaper model | Use haiku/mini for classification; escalate to sonnet only when needed |
Latency vs Quality
| Approach | Latency | Quality |
|---|---|---|
| Streaming | Feels fast (first token ~200ms–2s) | Same final quality |
| Smaller model | Lower | May miss nuance |
| Edge deployment | Lower for global users | Cold start considerations |
| No RAG | Faster | More hallucination on internal facts |
| Heavy RAG (top 20 chunks) | Slower | Better grounding |
Measure p95 time-to-first-token and p95 total response time in production. For reasoning models, add p95 thinking time.
Dev Tool Usage Costs
Cursor, Copilot, and Claude Code bundle or meter premium requests separately from raw API pricing. Track:
- Requests per developer per day
- Premium model vs default usage
- Whether agents re-read entire files each turn (context cost)
Related
- Model Picker Cheat Sheet — task-to-model quick reference
- LLM APIs
- Anthropic API for Web Developers
- OpenAI API for Web Developers
- Context Engineering
- Choosing a Tool