Cost, Latency, and Tokens

Tokens are the metered unit for cloud LLMs. Ignoring cost works until the invoice arrives.

Token Basics

Term	Meaning
Input tokens	Everything you send — system prompt, history, RAG chunks
Output tokens	Model response (often priced higher)
Context window	Max input + output combined
Cached tokens	Re-used input at reduced cost (provider-dependent)

Rough rule: 1 token ≈ 4 characters in English (varies by language and tokenizer).

Model Selection

Tier	Examples (2026)	Use when
Fast / cheap	Haiku, GPT-4o-mini, small open models	Classification, routing, simple edits
Balanced	Sonnet, GPT-4o	Most coding and chat features
Premium	Opus, o3, o4-mini (reasoning)	Hard bugs, architecture, low-volume critical tasks

For dev tooling: use premium models for Plan/review, cheaper models for boilerplate if your tool supports it.

For production chat: default to balanced; escalate to premium only when needed.

Prompt Caching

When the same large system prompt (docs, context, instructions) repeats across requests, caching it cuts input token cost by ~90% on cache hits.

Anthropic — mark stable content with cacheControl:

import { anthropic } from "@ai-sdk/anthropic";
import { streamText } from "ai";

// First request indexes the system prompt (~5 min TTL).
// Subsequent requests with the same prompt hit the cache.
const result = streamText({
  model: anthropic("claude-sonnet-4-20250514"),
  system: `You are a support agent.\n\n${longProductDocs}`, // stable, cache candidate
  messages,
  providerOptions: {
    anthropic: { cacheControl: { type: "ephemeral" } },
  },
});

OpenAI — prompt caching is automatic for prompts ≥1,024 tokens that share a common prefix. The API response includes usage.prompt_tokens_details.cached_tokens:

import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";

const result = streamText({
  model: openai("gpt-4o"),
  // OpenAI caches automatically when the same prefix recurs within ~1 hour
  system: `You are a support agent.\n\n${longProductDocs}`,
  messages,
});

// Inspect cache hit rate from usage (non-streaming generateText):
// result.usage.providerMetadata?.openai?.cachedPromptTokens

Reference: Anthropic prompt caching · OpenAI prompt caching.

History Truncation

Every message in the conversation history adds input tokens. Without truncation, a long chat session becomes expensive and slow.

// lib/truncate-messages.ts
import type { CoreMessage } from "ai";

/**
 * Keep the last N messages plus any system context.
 * For production: summarize dropped turns instead of discarding them.
 */
export function truncateHistory(
  messages: CoreMessage[],
  maxMessages = 20
): CoreMessage[] {
  if (messages.length <= maxMessages) return messages;

  // Always keep an even number so user/assistant turns stay paired
  const keep = maxMessages % 2 === 0 ? maxMessages : maxMessages - 1;
  return messages.slice(-keep);
}

// app/api/chat/route.ts
export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: anthropic("claude-sonnet-4-20250514"),
    messages: truncateHistory(messages, 20),
    maxTokens: 1024,
  });

  return result.toDataStreamResponse();
}

For long-running assistants, summarize the dropped history rather than discarding it: "Summarize the first 10 turns in 3 sentences" → inject summary as a system message.

Capping Output

Always set maxTokens on user-facing routes. Uncapped streaming can run for minutes and drive up costs.

const result = streamText({
  model: openai("gpt-4o"),
  messages,
  maxTokens: 1024,      // hard cap on response length
  temperature: 0.7,
});

For reasoning models (o3, o4-mini, Claude with extended thinking), budget separately — thinking tokens count toward output cost.

Request Logging Middleware

Log token counts per request to track spend and detect runaway loops or abuse:

// lib/llm-logger.ts
export interface LLMLogEntry {
  timestamp: string;
  model: string;
  inputTokens: number;
  outputTokens: number;
  cachedTokens?: number;
  latencyMs: number;
  userId?: string;
  sessionId?: string;
}

export function logLLMUsage(entry: LLMLogEntry) {
  // Replace with your observability stack (Datadog, Axiom, custom DB)
  console.log(JSON.stringify({ event: "llm_request", ...entry }));
}

// app/api/chat/route.ts — wire up after streamText resolves
export async function POST(req: Request) {
  const { messages, sessionId } = await req.json();
  const start = Date.now();

  const result = streamText({
    model: anthropic("claude-sonnet-4-20250514"),
    messages: truncateHistory(messages),
    maxTokens: 1024,
    onFinish: ({ usage }) => {
      logLLMUsage({
        timestamp: new Date().toISOString(),
        model: "claude-sonnet-4-20250514",
        inputTokens: usage.promptTokens,
        outputTokens: usage.completionTokens,
        latencyMs: Date.now() - start,
        sessionId,
      });
    },
  });

  return result.toDataStreamResponse();
}

Alert on input token spikes (retry loops) or output token spikes (runaway generation).

Cost Control Tactics

Tactic	Savings
Prompt caching	~90% on cached input tokens for stable system prompts
Truncate history	Reduces input tokens per turn in long sessions
Smaller RAG chunks	Fewer input tokens per query
`maxTokens` cap	Prevents runaway outputs
Rate limits per user	Prevent abuse before provider org limits kick in
Batch API for offline jobs	~50% cheaper than interactive endpoint (Anthropic, OpenAI)
Route to cheaper model	Use haiku/mini for classification; escalate to sonnet only when needed

Latency vs Quality

Approach	Latency	Quality
Streaming	Feels fast (first token ~200ms–2s)	Same final quality
Smaller model	Lower	May miss nuance
Edge deployment	Lower for global users	Cold start considerations
No RAG	Faster	More hallucination on internal facts
Heavy RAG (top 20 chunks)	Slower	Better grounding

Measure p95 time-to-first-token and p95 total response time in production. For reasoning models, add p95 thinking time.

Dev Tool Usage Costs

Cursor, Copilot, and Claude Code bundle or meter premium requests separately from raw API pricing. Track:

Requests per developer per day
Premium model vs default usage
Whether agents re-read entire files each turn (context cost)

Model Picker Cheat Sheet — task-to-model quick reference
LLM APIs
Anthropic API for Web Developers
OpenAI API for Web Developers
Context Engineering
Choosing a Tool

Stop vibe-debugging.

Cost, Latency, and Tokens

Token Basics

Model Selection

Prompt Caching

History Truncation

Capping Output

Request Logging Middleware

Cost Control Tactics

Latency vs Quality

Dev Tool Usage Costs

Stop vibe-debugging.

On this page

Token Basics

Model Selection

Prompt Caching

History Truncation

Capping Output

Request Logging Middleware

Cost Control Tactics

Latency vs Quality

Dev Tool Usage Costs

Related