Reasoning Models and Local LLMs

Reasoning models spend extra compute on internal chain-of-thought before answering — better on hard problems, worse on latency-sensitive autocomplete.

Last reviewed: June 2026

Model families and IDs change quarterly. Verify OpenAI models, Anthropic models, and Ollama library before production decisions.

Model categories

Category	Examples (2026)	Best for
Fast chat	GPT-4o-mini, Claude Haiku, Gemini Flash	High-volume chat, classification, routing
Balanced	GPT-4o, Claude Sonnet, Gemini Pro	Daily coding agents, tool calling
Reasoning / thinking	OpenAI o3/o4-mini, Claude extended thinking	Hard bugs, architecture, multi-step analysis
Local open weights	Llama 3.x, Qwen, DeepSeek via Ollama	Offline dev, privacy, token cost experiments

See Model Picker Cheat Sheet for task mapping.

When reasoning models help

Task	Reasoning model	Fast model
Intermittent CI failure with race condition	Yes	Often misses timing
"Add a button" UI component	Overkill	Yes
Security audit of auth flow	Yes	Misses subtle gaps
Boilerplate CRUD	Overkill	Yes
Complex SQL query optimization	Yes	May hallucinate plans
Inline tab completion	Too slow	Yes

Rule: Use reasoning models for low-volume, high-stakes analysis. Use fast models for high-volume, low-latency loops.

Extended thinking (Anthropic)

Claude supports extended thinking budgets — the model allocates internal reasoning tokens before the visible answer.

import { anthropic } from "@ai-sdk/anthropic";
import { generateText } from "ai";

const { text } = await generateText({
  model: anthropic("claude-sonnet-4-20250514"),
  prompt: "Analyze this deadlock stack trace and propose a fix...",
  providerOptions: {
    anthropic: { thinking: { type: "enabled", budgetTokens: 8000 } },
  },
});

Verify current API shape in Anthropic extended thinking docs. Thinking tokens often bill separately.

OpenAI reasoning family

OpenAI o-series models trade latency for depth. Use via API or Codex for hard refactors — not for streaming chat at scale.

Signal	Action
Fast model fails twice on same bug	Escalate to reasoning model with full error context
User-facing chat	Stay on GPT-4o class
Batch offline analysis	Reasoning models + structured output

Details: OpenAI API.

Local models with Ollama

Run models on your machine for privacy experiments and offline dev:

# Install Ollama — see ollama.com for your OS
ollama pull llama3.1
ollama run llama3.1

Connect from tools:

Tool	Configuration
Continue.dev	`provider: ollama` in config.yaml
Cursor	Custom OpenAI-compatible endpoint (if supported in your version)
Aider	`--model ollama/llama3.1` (verify model string)
Cline	Local provider settings

Local model tradeoffs

Pro	Con
No cloud data transfer	Weaker on complex refactors vs frontier APIs
No per-token cloud bill	GPU/RAM requirements; laptop thermal throttling
Works offline	You maintain model updates
Good for learning / prototyping	Not for production user-facing features

For production local inference, plan GPU capacity and model versioning — see AI Platforms.

Agentic loops and reasoning

Reasoning models fit the plan phase of Agentic Workflows:

Plan (reasoning model) — analyze failure, list files
Implement (fast model) — apply scoped edits
Verify (CI + human) — tests and review

Splitting models by phase controls cost better than one premium model for everything.

Cost and latency

Model tier	Latency	Relative cost
Fast chat	1–3 s TTFT	$
Balanced	2–8 s	$$
Reasoning	10–60+ s	$$$
Local 7B	Hardware-bound	Electricity + hardware

Track usage with LLM Observability and Cost, Latency, and Tokens.

Stop vibe-debugging.

Reasoning Models

Model categories

When reasoning models help

Extended thinking (Anthropic)

OpenAI reasoning family

Local models with Ollama

Local model tradeoffs

Agentic loops and reasoning

Cost and latency

Stop vibe-debugging.

On this page

Model categories

When reasoning models help

Extended thinking (Anthropic)

OpenAI reasoning family

Local models with Ollama

Agentic loops and reasoning

Cost and latency

Related