Reasoning Models
Reasoning models spend extra compute on internal chain-of-thought before answering — better on hard problems, worse on latency-sensitive autocomplete.
Last reviewed: June 2026
Model families and IDs change quarterly. Verify OpenAI models, Anthropic models, and Ollama library before production decisions.
Model categories
| Category | Examples (2026) | Best for |
|---|---|---|
| Fast chat | GPT-4o-mini, Claude Haiku, Gemini Flash | High-volume chat, classification, routing |
| Balanced | GPT-4o, Claude Sonnet, Gemini Pro | Daily coding agents, tool calling |
| Reasoning / thinking | OpenAI o3/o4-mini, Claude extended thinking | Hard bugs, architecture, multi-step analysis |
| Local open weights | Llama 3.x, Qwen, DeepSeek via Ollama | Offline dev, privacy, token cost experiments |
See Model Picker Cheat Sheet for task mapping.
When reasoning models help
| Task | Reasoning model | Fast model |
|---|---|---|
| Intermittent CI failure with race condition | Yes | Often misses timing |
| "Add a button" UI component | Overkill | Yes |
| Security audit of auth flow | Yes | Misses subtle gaps |
| Boilerplate CRUD | Overkill | Yes |
| Complex SQL query optimization | Yes | May hallucinate plans |
| Inline tab completion | Too slow | Yes |
Rule: Use reasoning models for low-volume, high-stakes analysis. Use fast models for high-volume, low-latency loops.
Extended thinking (Anthropic)
Claude supports extended thinking budgets — the model allocates internal reasoning tokens before the visible answer.
import { anthropic } from "@ai-sdk/anthropic";
import { generateText } from "ai";
const { text } = await generateText({
model: anthropic("claude-sonnet-4-20250514"),
prompt: "Analyze this deadlock stack trace and propose a fix...",
providerOptions: {
anthropic: { thinking: { type: "enabled", budgetTokens: 8000 } },
},
});
Verify current API shape in Anthropic extended thinking docs. Thinking tokens often bill separately.
OpenAI reasoning family
OpenAI o-series models trade latency for depth. Use via API or Codex for hard refactors — not for streaming chat at scale.
| Signal | Action |
|---|---|
| Fast model fails twice on same bug | Escalate to reasoning model with full error context |
| User-facing chat | Stay on GPT-4o class |
| Batch offline analysis | Reasoning models + structured output |
Details: OpenAI API.
Local models with Ollama
Run models on your machine for privacy experiments and offline dev:
# Install Ollama — see ollama.com for your OS
ollama pull llama3.1
ollama run llama3.1
Connect from tools:
| Tool | Configuration |
|---|---|
| Continue.dev | provider: ollama in config.yaml |
| Cursor | Custom OpenAI-compatible endpoint (if supported in your version) |
| Aider | --model ollama/llama3.1 (verify model string) |
| Cline | Local provider settings |
Local model tradeoffs
| Pro | Con |
|---|---|
| No cloud data transfer | Weaker on complex refactors vs frontier APIs |
| No per-token cloud bill | GPU/RAM requirements; laptop thermal throttling |
| Works offline | You maintain model updates |
| Good for learning / prototyping | Not for production user-facing features |
For production local inference, plan GPU capacity and model versioning — see AI Platforms.
Agentic loops and reasoning
Reasoning models fit the plan phase of Agentic Workflows:
- Plan (reasoning model) — analyze failure, list files
- Implement (fast model) — apply scoped edits
- Verify (CI + human) — tests and review
Splitting models by phase controls cost better than one premium model for everything.
Cost and latency
| Model tier | Latency | Relative cost |
|---|---|---|
| Fast chat | 1–3 s TTFT | $ |
| Balanced | 2–8 s | $$ |
| Reasoning | 10–60+ s | $$$ |
| Local 7B | Hardware-bound | Electricity + hardware |
Track usage with LLM Observability and Cost, Latency, and Tokens.