Fine-Tuning vs RAG
Most teams need RAG or better prompts, not fine-tuning. Fine-tuning pays off when you need consistent style or domain format at scale — not when you need fresh facts.
Last reviewed: June 2026
Fine-tuning APIs and pricing change by provider. Verify OpenAI, Anthropic, and Google fine-tuning docs before committing.
Decision matrix
| Need | Best approach |
|---|---|
| Answer from private docs (policies, code, tickets) | RAG |
| Stable output format (JSON, tone, template) | Structured outputs first; fine-tune if insufficient |
| Up-to-date facts (pricing, APIs) | RAG or tool calling — fine-tune goes stale |
| Cheap high-volume classification | Fine-tune small model or structured outputs on mini model |
| Coding agent over monorepo | RAG / codebase search / MCP — not fine-tune |
| Proprietary medical/legal phrasing | Fine-tune + RAG + human review |
Comparison
| RAG | Fine-tuning | Prompt + rules | |
|---|---|---|---|
| Setup effort | Index pipeline | Dataset + training jobs | Low |
| Fresh data | Re-index | Retrain | Edit prompt |
| Cost model | Retrieval + inference | Training + inference | Inference only |
| Hallucination risk | Grounded if cited | Can still hallucinate | Highest |
| Maintenance | Index drift | Dataset drift | Prompt drift |
Full RAG guide: RAG for Codebases.
When fine-tuning makes sense
- Thousands of examples of desired input→output pairs
- Style/format consistency matters more than factual retrieval
- Latency budget requires smaller fine-tuned model vs large prompt
- Legal/compliance approved training data pipeline
When fine-tuning is the wrong tool
- Knowledge changes weekly (product catalog, API docs)
- Small team without eval harness
- "Make it know our codebase" — use RAG, MCP, or
@codebaseagents instead - Prototype phase — prompt until metrics plateau
Hybrid pattern
System prompt (stable) + RAG chunks (fresh) + fine-tuned model (tone/format)
Evaluate each layer separately — LLM Observability.
Production concerns
| Concern | RAG | Fine-tuning |
|---|---|---|
| Cost | Index storage + embedding calls | Training run + hosted model |
| Latency | Retrieval step added | Usually lower at inference |
| Failure modes | Bad retrieval → wrong answer | Overfit → brittle outputs |
| Compliance | Control what is indexed | Control training data provenance |