How LLMs Work (For Developers)

Large language models (LLMs) predict the next token in a sequence. That simple mechanism, scaled up, produces code that often works — and often does not.

What an LLM Actually Does

Given text so far, the model outputs a probability distribution over possible next tokens. Repeat until stop condition. There is no built-in compiler, test runner, or connection to your repo unless a tool provides one.

Implication: Fluency ≠ correctness. The model optimizes for plausible text, not runnable code.

Tokens and Context Windows

Text is split into tokens (word pieces). Models have a context window — maximum tokens in one request (input + output).

Concept	Developer takeaway
Long files eat context	Attach only relevant files
Long chat history fills window	Summarize or start fresh sessions
Output length is capped	Set `max_tokens`; stream for UX

See Cost, Latency, and Tokens.

Transformers and Attention (Intuition)

Modern LLMs use transformer architecture with attention — each token can "look at" other tokens in the context to resolve meaning.

You do not need to implement attention. You need to know:

Order and proximity matter — put important instructions early
Distant context may get less weight in very long prompts
Code far from the current task may be ignored unless @-mentioned

Training and Knowledge Cutoffs

Models learn from large text corpora (code, docs, forums) up to a training cutoff date. They do not know:

APIs released after cutoff (unless tools fetch live docs)
Your private codebase (unless you provide context)
Current package versions unless stated in prompt/rules

This drives hallucinated APIs: the model fills gaps with plausible fiction.

What hallucinated output looks like

// AI-generated — looks correct, does not exist
import { useFormValidation } from "react-hook-form";

const { validateAsync, fieldErrors } = useFormValidation({
  schema: signupSchema,
  onSuccess: handleSignup,
});

useFormValidation is not a real react-hook-form export. The model assembled something plausible from its training data. Running npm test or checking the docs catches it — eyeballing passes it.

Another common form: the model uses a real package but invents method names:

// Anthropic SDK hallucination — real package, fake method
const stream = await anthropic.messages.streamWithRetry({ ... });
// Actual method: anthropic.messages.stream()

Fix: Verify unfamiliar imports against official docs. Check method signatures in your IDE's type definitions after generation.

Fine-Tuning vs Prompting vs RAG

Approach	What it changes	When
Prompting + rules	Behavior per session	Default for coding tools
RAG	Facts retrieved at query time	Internal docs, large codebases
Fine-tuning	Model weights	Domain-specific language; rare for app devs

Most teams: prompting + IDE indexing + MCP before fine-tuning.

Standard Models vs Reasoning Models

Modern providers offer two distinct model families with different tradeoffs:

	Standard (e.g. GPT-4o, Claude Sonnet)	Reasoning (e.g. o3, o4-mini, Claude with `thinking`)
How it works	Predicts tokens directly	Internal "scratchpad" before answering
Speed	Fast (seconds)	Slower (tens of seconds to minutes)
Best for	Chat, completions, boilerplate, tool calling	Hard bugs, algorithm design, multi-step logic
Cost	Lower (standard input/output pricing)	Higher (thinking tokens billed as output)
Temperature	Adjustable	Often fixed at 1 for reasoning

When to use reasoning models:

Debugging a race condition or memory leak with no obvious cause
Designing a schema migration with complex dependencies
Analyzing a security vulnerability across multiple files
Any task where you'd want a senior engineer to "think it through"

When to stick with standard models:

Inline completions and chat replies
Boilerplate generation from a clear pattern
Classification, routing, form extraction
Anything latency-sensitive

You can mix models in the same workflow: use a reasoning model for Plan, a standard model for Agent implementation.

Capabilities and Hard Limits

Good at:

Boilerplate matching existing patterns
Explaining familiar code structures
Regex, SQL, config from examples
Refactors with clear instructions

Bad at:

Guaranteed correctness — Example: a generated auth middleware that checks the wrong field name will look exactly like a correct one
Novel architecture — The model will produce a design, not necessarily the right design for your constraints
Pixel-perfect UI — Generated Tailwind className stacks rarely match a comp without visual iteration
Secrets and security — SQL injection, hardcoded keys, and missing auth checks appear in plausible, compiling code; see Security Anti-patterns
Package version accuracy — A model trained in late 2024 knows nothing about a package update from 2025

Temperature and Determinism

Higher temperature → more creative, less reproducible. For code generation, lower temperature (or provider "deterministic" settings) reduces randomness.

Run the same prompt twice — you may get different diffs. That is expected. For CI automation, use lower temperature and pin model IDs so output is consistent.

Next Steps

Tokens and Context — practical context budgeting
Evaluating Model Output — when to trust vs verify
Common Mistakes in AI-Generated Code: failures you will see in practice
Prompting for Code — how to steer models
Context Engineering — what to show the model

Stop vibe-debugging.

How LLMs Work

What an LLM Actually Does

Tokens and Context Windows

Transformers and Attention (Intuition)

Training and Knowledge Cutoffs

What hallucinated output looks like

Fine-Tuning vs Prompting vs RAG

Standard Models vs Reasoning Models

Capabilities and Hard Limits

Temperature and Determinism

Next Steps

Stop vibe-debugging.

On this page

What an LLM Actually Does

Tokens and Context Windows

Transformers and Attention (Intuition)

Training and Knowledge Cutoffs

Fine-Tuning vs Prompting vs RAG

Standard Models vs Reasoning Models

Capabilities and Hard Limits

Temperature and Determinism

Next Steps