1. ai
  2. /web apps
  3. /cost-and-tokens

Cost, Latency, and Tokens

Tokens are the metered unit for cloud LLMs. Ignoring cost works until the invoice arrives.

Token Basics

TermMeaning
Input tokensEverything you send — system prompt, history, RAG chunks
Output tokensModel response (often priced higher)
Context windowMax input + output combined
Cached tokensRe-used input at reduced cost (provider-dependent)

Rough rule: 1 token ≈ 4 characters in English (varies by language and tokenizer).

Model Selection

TierExamples (2026)Use when
Fast / cheapHaiku, GPT-4o-mini, small open modelsClassification, routing, simple edits
BalancedSonnet, GPT-4oMost coding and chat features
PremiumOpus, o3, o4-mini (reasoning)Hard bugs, architecture, low-volume critical tasks

For dev tooling: use premium models for Plan/review, cheaper models for boilerplate if your tool supports it.

For production chat: default to balanced; escalate to premium only when needed.

Prompt Caching

When the same large system prompt (docs, context, instructions) repeats across requests, caching it cuts input token cost by ~90% on cache hits.

Anthropic — mark stable content with cacheControl:

import { anthropic } from "@ai-sdk/anthropic";
import { streamText } from "ai";

// First request indexes the system prompt (~5 min TTL).
// Subsequent requests with the same prompt hit the cache.
const result = streamText({
  model: anthropic("claude-sonnet-4-20250514"),
  system: `You are a support agent.\n\n${longProductDocs}`, // stable, cache candidate
  messages,
  providerOptions: {
    anthropic: { cacheControl: { type: "ephemeral" } },
  },
});

OpenAI — prompt caching is automatic for prompts ≥1,024 tokens that share a common prefix. The API response includes usage.prompt_tokens_details.cached_tokens:

import { openai } from "@ai-sdk/openai";
import { streamText } from "ai";

const result = streamText({
  model: openai("gpt-4o"),
  // OpenAI caches automatically when the same prefix recurs within ~1 hour
  system: `You are a support agent.\n\n${longProductDocs}`,
  messages,
});

// Inspect cache hit rate from usage (non-streaming generateText):
// result.usage.providerMetadata?.openai?.cachedPromptTokens

Reference: Anthropic prompt caching · OpenAI prompt caching.

History Truncation

Every message in the conversation history adds input tokens. Without truncation, a long chat session becomes expensive and slow.

// lib/truncate-messages.ts
import type { CoreMessage } from "ai";

/**
 * Keep the last N messages plus any system context.
 * For production: summarize dropped turns instead of discarding them.
 */
export function truncateHistory(
  messages: CoreMessage[],
  maxMessages = 20
): CoreMessage[] {
  if (messages.length <= maxMessages) return messages;

  // Always keep an even number so user/assistant turns stay paired
  const keep = maxMessages % 2 === 0 ? maxMessages : maxMessages - 1;
  return messages.slice(-keep);
}

// app/api/chat/route.ts
export async function POST(req: Request) {
  const { messages } = await req.json();

  const result = streamText({
    model: anthropic("claude-sonnet-4-20250514"),
    messages: truncateHistory(messages, 20),
    maxTokens: 1024,
  });

  return result.toDataStreamResponse();
}

For long-running assistants, summarize the dropped history rather than discarding it: "Summarize the first 10 turns in 3 sentences" → inject summary as a system message.

Capping Output

Always set maxTokens on user-facing routes. Uncapped streaming can run for minutes and drive up costs.

const result = streamText({
  model: openai("gpt-4o"),
  messages,
  maxTokens: 1024,      // hard cap on response length
  temperature: 0.7,
});

For reasoning models (o3, o4-mini, Claude with extended thinking), budget separately — thinking tokens count toward output cost.

Request Logging Middleware

Log token counts per request to track spend and detect runaway loops or abuse:

// lib/llm-logger.ts
export interface LLMLogEntry {
  timestamp: string;
  model: string;
  inputTokens: number;
  outputTokens: number;
  cachedTokens?: number;
  latencyMs: number;
  userId?: string;
  sessionId?: string;
}

export function logLLMUsage(entry: LLMLogEntry) {
  // Replace with your observability stack (Datadog, Axiom, custom DB)
  console.log(JSON.stringify({ event: "llm_request", ...entry }));
}

// app/api/chat/route.ts — wire up after streamText resolves
export async function POST(req: Request) {
  const { messages, sessionId } = await req.json();
  const start = Date.now();

  const result = streamText({
    model: anthropic("claude-sonnet-4-20250514"),
    messages: truncateHistory(messages),
    maxTokens: 1024,
    onFinish: ({ usage }) => {
      logLLMUsage({
        timestamp: new Date().toISOString(),
        model: "claude-sonnet-4-20250514",
        inputTokens: usage.promptTokens,
        outputTokens: usage.completionTokens,
        latencyMs: Date.now() - start,
        sessionId,
      });
    },
  });

  return result.toDataStreamResponse();
}

Alert on input token spikes (retry loops) or output token spikes (runaway generation).

Cost Control Tactics

TacticSavings
Prompt caching~90% on cached input tokens for stable system prompts
Truncate historyReduces input tokens per turn in long sessions
Smaller RAG chunksFewer input tokens per query
maxTokens capPrevents runaway outputs
Rate limits per userPrevent abuse before provider org limits kick in
Batch API for offline jobs~50% cheaper than interactive endpoint (Anthropic, OpenAI)
Route to cheaper modelUse haiku/mini for classification; escalate to sonnet only when needed

Latency vs Quality

ApproachLatencyQuality
StreamingFeels fast (first token ~200ms–2s)Same final quality
Smaller modelLowerMay miss nuance
Edge deploymentLower for global usersCold start considerations
No RAGFasterMore hallucination on internal facts
Heavy RAG (top 20 chunks)SlowerBetter grounding

Measure p95 time-to-first-token and p95 total response time in production. For reasoning models, add p95 thinking time.

Dev Tool Usage Costs

Cursor, Copilot, and Claude Code bundle or meter premium requests separately from raw API pricing. Track:

  • Requests per developer per day
  • Premium model vs default usage
  • Whether agents re-read entire files each turn (context cost)