1. ai
  2. /foundations
  3. /reasoning-models

Reasoning Models

Reasoning models spend extra compute on internal chain-of-thought before answering — better on hard problems, worse on latency-sensitive autocomplete.

Last reviewed: June 2026

Model families and IDs change quarterly. Verify OpenAI models, Anthropic models, and Ollama library before production decisions.

Model categories

CategoryExamples (2026)Best for
Fast chatGPT-4o-mini, Claude Haiku, Gemini FlashHigh-volume chat, classification, routing
BalancedGPT-4o, Claude Sonnet, Gemini ProDaily coding agents, tool calling
Reasoning / thinkingOpenAI o3/o4-mini, Claude extended thinkingHard bugs, architecture, multi-step analysis
Local open weightsLlama 3.x, Qwen, DeepSeek via OllamaOffline dev, privacy, token cost experiments

See Model Picker Cheat Sheet for task mapping.

When reasoning models help

TaskReasoning modelFast model
Intermittent CI failure with race conditionYesOften misses timing
"Add a button" UI componentOverkillYes
Security audit of auth flowYesMisses subtle gaps
Boilerplate CRUDOverkillYes
Complex SQL query optimizationYesMay hallucinate plans
Inline tab completionToo slowYes

Rule: Use reasoning models for low-volume, high-stakes analysis. Use fast models for high-volume, low-latency loops.

Extended thinking (Anthropic)

Claude supports extended thinking budgets — the model allocates internal reasoning tokens before the visible answer.

import { anthropic } from "@ai-sdk/anthropic";
import { generateText } from "ai";

const { text } = await generateText({
  model: anthropic("claude-sonnet-4-20250514"),
  prompt: "Analyze this deadlock stack trace and propose a fix...",
  providerOptions: {
    anthropic: { thinking: { type: "enabled", budgetTokens: 8000 } },
  },
});

Verify current API shape in Anthropic extended thinking docs. Thinking tokens often bill separately.

OpenAI reasoning family

OpenAI o-series models trade latency for depth. Use via API or Codex for hard refactors — not for streaming chat at scale.

SignalAction
Fast model fails twice on same bugEscalate to reasoning model with full error context
User-facing chatStay on GPT-4o class
Batch offline analysisReasoning models + structured output

Details: OpenAI API.

Local models with Ollama

Run models on your machine for privacy experiments and offline dev:

# Install Ollama — see ollama.com for your OS
ollama pull llama3.1
ollama run llama3.1

Connect from tools:

ToolConfiguration
Continue.devprovider: ollama in config.yaml
CursorCustom OpenAI-compatible endpoint (if supported in your version)
Aider--model ollama/llama3.1 (verify model string)
ClineLocal provider settings

Local model tradeoffs

ProCon
No cloud data transferWeaker on complex refactors vs frontier APIs
No per-token cloud billGPU/RAM requirements; laptop thermal throttling
Works offlineYou maintain model updates
Good for learning / prototypingNot for production user-facing features

For production local inference, plan GPU capacity and model versioning — see AI Platforms.

Agentic loops and reasoning

Reasoning models fit the plan phase of Agentic Workflows:

  1. Plan (reasoning model) — analyze failure, list files
  2. Implement (fast model) — apply scoped edits
  3. Verify (CI + human) — tests and review

Splitting models by phase controls cost better than one premium model for everything.

Cost and latency

Model tierLatencyRelative cost
Fast chat1–3 s TTFT$
Balanced2–8 s$$
Reasoning10–60+ s$$$
Local 7BHardware-boundElectricity + hardware

Track usage with LLM Observability and Cost, Latency, and Tokens.