1. ai
  2. /foundations
  3. /how-llms-work

How LLMs Work

Large language models (LLMs) predict the next token in a sequence. That simple mechanism, scaled up, produces code that often works — and often does not.

What an LLM Actually Does

Given text so far, the model outputs a probability distribution over possible next tokens. Repeat until stop condition. There is no built-in compiler, test runner, or connection to your repo unless a tool provides one.

Implication: Fluency ≠ correctness. The model optimizes for plausible text, not runnable code.

Tokens and Context Windows

Text is split into tokens (word pieces). Models have a context window — maximum tokens in one request (input + output).

ConceptDeveloper takeaway
Long files eat contextAttach only relevant files
Long chat history fills windowSummarize or start fresh sessions
Output length is cappedSet max_tokens; stream for UX

See Cost, Latency, and Tokens.

Transformers and Attention (Intuition)

Modern LLMs use transformer architecture with attention — each token can "look at" other tokens in the context to resolve meaning.

You do not need to implement attention. You need to know:

  • Order and proximity matter — put important instructions early
  • Distant context may get less weight in very long prompts
  • Code far from the current task may be ignored unless @-mentioned

Training and Knowledge Cutoffs

Models learn from large text corpora (code, docs, forums) up to a training cutoff date. They do not know:

  • APIs released after cutoff (unless tools fetch live docs)
  • Your private codebase (unless you provide context)
  • Current package versions unless stated in prompt/rules

This drives hallucinated APIs: the model fills gaps with plausible fiction.

What hallucinated output looks like

// AI-generated — looks correct, does not exist
import { useFormValidation } from "react-hook-form";

const { validateAsync, fieldErrors } = useFormValidation({
  schema: signupSchema,
  onSuccess: handleSignup,
});

useFormValidation is not a real react-hook-form export. The model assembled something plausible from its training data. Running npm test or checking the docs catches it — eyeballing passes it.

Another common form: the model uses a real package but invents method names:

// Anthropic SDK hallucination — real package, fake method
const stream = await anthropic.messages.streamWithRetry({ ... });
// Actual method: anthropic.messages.stream()

Fix: Verify unfamiliar imports against official docs. Check method signatures in your IDE's type definitions after generation.

Fine-Tuning vs Prompting vs RAG

ApproachWhat it changesWhen
Prompting + rulesBehavior per sessionDefault for coding tools
RAGFacts retrieved at query timeInternal docs, large codebases
Fine-tuningModel weightsDomain-specific language; rare for app devs

Most teams: prompting + IDE indexing + MCP before fine-tuning.

Standard Models vs Reasoning Models

Modern providers offer two distinct model families with different tradeoffs:

Standard (e.g. GPT-4o, Claude Sonnet)Reasoning (e.g. o3, o4-mini, Claude with thinking)
How it worksPredicts tokens directlyInternal "scratchpad" before answering
SpeedFast (seconds)Slower (tens of seconds to minutes)
Best forChat, completions, boilerplate, tool callingHard bugs, algorithm design, multi-step logic
CostLower (standard input/output pricing)Higher (thinking tokens billed as output)
TemperatureAdjustableOften fixed at 1 for reasoning

When to use reasoning models:

  • Debugging a race condition or memory leak with no obvious cause
  • Designing a schema migration with complex dependencies
  • Analyzing a security vulnerability across multiple files
  • Any task where you'd want a senior engineer to "think it through"

When to stick with standard models:

  • Inline completions and chat replies
  • Boilerplate generation from a clear pattern
  • Classification, routing, form extraction
  • Anything latency-sensitive

You can mix models in the same workflow: use a reasoning model for Plan, a standard model for Agent implementation.

Capabilities and Hard Limits

Good at:

  • Boilerplate matching existing patterns
  • Explaining familiar code structures
  • Regex, SQL, config from examples
  • Refactors with clear instructions

Bad at:

  • Guaranteed correctness — Example: a generated auth middleware that checks the wrong field name will look exactly like a correct one
  • Novel architecture — The model will produce a design, not necessarily the right design for your constraints
  • Pixel-perfect UI — Generated Tailwind className stacks rarely match a comp without visual iteration
  • Secrets and security — SQL injection, hardcoded keys, and missing auth checks appear in plausible, compiling code; see Security Anti-patterns
  • Package version accuracy — A model trained in late 2024 knows nothing about a package update from 2025

Temperature and Determinism

Higher temperature → more creative, less reproducible. For code generation, lower temperature (or provider "deterministic" settings) reduces randomness.

Run the same prompt twice — you may get different diffs. That is expected. For CI automation, use lower temperature and pin model IDs so output is consistent.

Next Steps