How LLMs Work
Large language models (LLMs) predict the next token in a sequence. That simple mechanism, scaled up, produces code that often works — and often does not.
What an LLM Actually Does
Given text so far, the model outputs a probability distribution over possible next tokens. Repeat until stop condition. There is no built-in compiler, test runner, or connection to your repo unless a tool provides one.
Implication: Fluency ≠ correctness. The model optimizes for plausible text, not runnable code.
Tokens and Context Windows
Text is split into tokens (word pieces). Models have a context window — maximum tokens in one request (input + output).
| Concept | Developer takeaway |
|---|---|
| Long files eat context | Attach only relevant files |
| Long chat history fills window | Summarize or start fresh sessions |
| Output length is capped | Set max_tokens; stream for UX |
See Cost, Latency, and Tokens.
Transformers and Attention (Intuition)
Modern LLMs use transformer architecture with attention — each token can "look at" other tokens in the context to resolve meaning.
You do not need to implement attention. You need to know:
- Order and proximity matter — put important instructions early
- Distant context may get less weight in very long prompts
- Code far from the current task may be ignored unless @-mentioned
Training and Knowledge Cutoffs
Models learn from large text corpora (code, docs, forums) up to a training cutoff date. They do not know:
- APIs released after cutoff (unless tools fetch live docs)
- Your private codebase (unless you provide context)
- Current package versions unless stated in prompt/rules
This drives hallucinated APIs: the model fills gaps with plausible fiction.
What hallucinated output looks like
// AI-generated — looks correct, does not exist
import { useFormValidation } from "react-hook-form";
const { validateAsync, fieldErrors } = useFormValidation({
schema: signupSchema,
onSuccess: handleSignup,
});
useFormValidation is not a real react-hook-form export. The model assembled something plausible from its training data. Running npm test or checking the docs catches it — eyeballing passes it.
Another common form: the model uses a real package but invents method names:
// Anthropic SDK hallucination — real package, fake method
const stream = await anthropic.messages.streamWithRetry({ ... });
// Actual method: anthropic.messages.stream()
Fix: Verify unfamiliar imports against official docs. Check method signatures in your IDE's type definitions after generation.
Fine-Tuning vs Prompting vs RAG
| Approach | What it changes | When |
|---|---|---|
| Prompting + rules | Behavior per session | Default for coding tools |
| RAG | Facts retrieved at query time | Internal docs, large codebases |
| Fine-tuning | Model weights | Domain-specific language; rare for app devs |
Most teams: prompting + IDE indexing + MCP before fine-tuning.
Standard Models vs Reasoning Models
Modern providers offer two distinct model families with different tradeoffs:
| Standard (e.g. GPT-4o, Claude Sonnet) | Reasoning (e.g. o3, o4-mini, Claude with thinking) | |
|---|---|---|
| How it works | Predicts tokens directly | Internal "scratchpad" before answering |
| Speed | Fast (seconds) | Slower (tens of seconds to minutes) |
| Best for | Chat, completions, boilerplate, tool calling | Hard bugs, algorithm design, multi-step logic |
| Cost | Lower (standard input/output pricing) | Higher (thinking tokens billed as output) |
| Temperature | Adjustable | Often fixed at 1 for reasoning |
When to use reasoning models:
- Debugging a race condition or memory leak with no obvious cause
- Designing a schema migration with complex dependencies
- Analyzing a security vulnerability across multiple files
- Any task where you'd want a senior engineer to "think it through"
When to stick with standard models:
- Inline completions and chat replies
- Boilerplate generation from a clear pattern
- Classification, routing, form extraction
- Anything latency-sensitive
You can mix models in the same workflow: use a reasoning model for Plan, a standard model for Agent implementation.
Capabilities and Hard Limits
Good at:
- Boilerplate matching existing patterns
- Explaining familiar code structures
- Regex, SQL, config from examples
- Refactors with clear instructions
Bad at:
- Guaranteed correctness — Example: a generated auth middleware that checks the wrong field name will look exactly like a correct one
- Novel architecture — The model will produce a design, not necessarily the right design for your constraints
- Pixel-perfect UI — Generated Tailwind className stacks rarely match a comp without visual iteration
- Secrets and security — SQL injection, hardcoded keys, and missing auth checks appear in plausible, compiling code; see Security Anti-patterns
- Package version accuracy — A model trained in late 2024 knows nothing about a package update from 2025
Temperature and Determinism
Higher temperature → more creative, less reproducible. For code generation, lower temperature (or provider "deterministic" settings) reduces randomness.
Run the same prompt twice — you may get different diffs. That is expected. For CI automation, use lower temperature and pin model IDs so output is consistent.
Next Steps
- Tokens and Context — practical context budgeting
- Evaluating Model Output — when to trust vs verify
- Common Mistakes in AI-Generated Code: failures you will see in practice
- Prompting for Code — how to steer models
- Context Engineering — what to show the model