RAG for Codebases
RAG (retrieval-augmented generation) feeds relevant snippets into the model at query time so answers ground in your docs and code — not stale training data.
Last reviewed: June 2026
Embedding models and vector DB APIs change often. Verify model IDs and pricing against OpenAI embeddings docs and your provider before production.
Prerequisites
You understand How LLMs Work and token budgets. For IDE-only retrieval, start with Context Engineering before building custom pipelines.
When You Need RAG
| Situation | RAG helps? |
|---|---|
| Large monorepo; model misses obscure modules | Yes |
| Internal APIs not on the public web | Yes |
| Small repo; @Files works fine | Probably not |
| Bleeding-edge public docs | @Web or MCP may suffice |
| User-facing product chat over your KB | Yes (production RAG pipeline) |
Build vs Buy
| Approach | Pros | Cons |
|---|---|---|
| IDE indexing (Cursor, etc.) | Zero setup; follows git ignore | Vendor-specific; limited customization |
| MCP doc server | You control sources; works across clients | You maintain the server |
| Custom vector DB | Full control; multi-tenant products | Infra cost; chunking tuning |
| Hosted (Pinecone, etc.) | Managed scaling | Cost; data residency questions |
For personal coding: start with IDE indexing + MCP. Build custom RAG when shipping a product feature.
Embedding Pipeline (Custom RAG)
flowchart TD
docs[Docs and code files] --> chunk[Chunk text]
chunk --> embed[Embedding API]
embed --> store[Vector store]
query[User question] --> qembed[Embed query]
qembed --> search[Similarity search]
store --> search
search --> prompt[Inject into prompt]
prompt --> llm[LLM response]
Primary references: OpenAI embeddings guide, OpenAI API for Web Developers, Anthropic docs on retrieval, Pinecone documentation.
Minimal RAG Example (In-Memory)
This runnable pattern mirrors the diagram above using the Vercel AI SDK and OpenAI embeddings. For production, swap the in-memory store for Pinecone, pgvector, or another vector DB.
// lib/rag.ts — index docs at startup or on webhook
import { embed, embedMany, cosineSimilarity } from "ai";
import { openai } from "@ai-sdk/openai";
const embeddingModel = openai.embedding("text-embedding-3-small");
type Chunk = { id: string; text: string; source: string; embedding: number[] };
const store: Chunk[] = [];
export async function indexDocuments(docs: { id: string; text: string; source: string }[]) {
const { embeddings } = await embedMany({
model: embeddingModel,
values: docs.map((d) => d.text),
});
docs.forEach((doc, i) => {
store.push({ ...doc, embedding: embeddings[i] });
});
}
export async function retrieve(query: string, topK = 5): Promise<string[]> {
const { embedding } = await embed({ model: embeddingModel, value: query });
const ranked = store
.map((chunk) => ({
chunk,
score: cosineSimilarity(embedding, chunk.embedding),
}))
.sort((a, b) => b.score - a.score)
.slice(0, topK);
return ranked.map(({ chunk }) => `[${chunk.source}]\n${chunk.text}`);
}
Use retrieved chunks in your chat route:
// app/api/chat/route.ts
import { anthropic } from "@ai-sdk/anthropic";
import { streamText } from "ai";
import { retrieve } from "@/lib/rag";
export async function POST(req: Request) {
const { messages } = await req.json();
const lastUser = messages.at(-1)?.content ?? "";
const context = (await retrieve(lastUser)).join("\n\n---\n\n");
const result = streamText({
model: anthropic("claude-sonnet-4-20250514"),
messages,
system: `Answer using only the context below. If the answer is not in context, say so.\n\n${context}`,
});
return result.toDataStreamResponse();
}
See Streaming Chat Tutorial for a full Next.js walkthrough and LLM APIs for streaming patterns.
Chunking Walkthrough
Before: one 4,000-token markdown file pasted whole — blows context and dilutes retrieval.
After: split on ## headings with metadata:
| Chunk | Source | Tokens (approx) |
|---|---|---|
# Auth\n\nOAuth flow… | docs/auth.md | ~400 |
## Refresh tokens\n\n… | docs/auth.md#refresh | ~350 |
## Session storage\n\n… | docs/auth.md#session | ~420 |
For source code, chunk on function boundaries and prefix each chunk with the file path:
src/lib/billing.ts :: function calculateProration(...)
Bad vs good chunking
Bad: Fixed 512-character splits mid-sentence — retrieval returns fragments without meaning.
Good: Semantic splits (headings, functions) with file_path and section_title metadata — users and models can cite sources.
Chunking Strategies
| Content type | Chunk approach |
|---|---|
| Markdown docs | Split on headings; ~500–1000 tokens |
| Source code | Split on functions/classes; include file path in metadata |
| API references | One endpoint per chunk |
| Logs / tickets | Split on timestamp or entry |
Include metadata: file_path, section_title, last_updated.
Evaluating Retrieval Quality
Bad RAG is worse than no RAG — irrelevant chunks confuse the model.
Spot-check set
Build 10–20 questions with known answers in your docs. For each question:
- Run retrieval with
topK = 5 - Ask: does any chunk contain the answer verbatim or by paraphrase?
- Record pass/fail
Target ≥80% recall@5 on your spot-check set before shipping.
Failure modes
| Symptom | Likely cause | Fix |
|---|---|---|
| Model says "I don't know" but doc exists | Chunk too large or poorly split | Re-chunk; add headings |
| Wrong module cited | Ambiguous names in code | Add path metadata to every chunk |
| Stale answers | Index not updated | Webhook re-index on merge to main |
| Confident wrong answer | Irrelevant chunk ranked high | Increase topK filtering; hybrid search (keyword + vector) |
Production monitoring
Log: query, top chunk IDs, similarity scores, user thumbs-down. Review weekly. Exclude archived paths at index time.
Operational checklist:
- Evaluate retrieval — do top-5 chunks contain the answer?
- Refresh index on doc changes (webhook on merge to main)
- Filter stale — exclude archived docs
- Cite sources in UI so users verify
pgvector (Production Vector Store)
For most Next.js apps with Postgres, pgvector is the lowest-friction production upgrade from in-memory RAG — no extra infra, same database you already have.
npm install @vercel/postgres drizzle-orm drizzle-kit
# Enable pgvector in your Postgres instance (Neon, Supabase, RDS, etc.)
# CREATE EXTENSION IF NOT EXISTS vector;
Schema (Drizzle)
// lib/db/schema.ts
import { pgTable, text, varchar, vector, index } from "drizzle-orm/pg-core";
export const docChunks = pgTable(
"doc_chunks",
{
id: varchar("id", { length: 128 }).primaryKey(),
source: text("source").notNull(), // file path or URL
sectionTitle: text("section_title"),
content: text("content").notNull(),
embedding: vector("embedding", { dimensions: 1536 }).notNull(),
},
(t) => ({
// HNSW index — faster queries at slight recall tradeoff
embeddingIdx: index("doc_chunks_embedding_idx").using(
"hnsw",
t.embedding.op("vector_cosine_ops")
),
})
);
Index documents
// lib/rag-pg.ts
import { db } from "@/lib/db";
import { docChunks } from "@/lib/db/schema";
import { embedMany } from "ai";
import { openai } from "@ai-sdk/openai";
import { sql } from "drizzle-orm";
const embeddingModel = openai.embedding("text-embedding-3-small");
export async function indexDocuments(
docs: { id: string; source: string; sectionTitle?: string; content: string }[]
) {
const { embeddings } = await embedMany({
model: embeddingModel,
values: docs.map((d) => d.content),
});
const rows = docs.map((doc, i) => ({ ...doc, embedding: embeddings[i] }));
// Upsert so re-runs don't duplicate
await db
.insert(docChunks)
.values(rows)
.onConflictDoUpdate({ target: docChunks.id, set: { content: sql`excluded.content`, embedding: sql`excluded.embedding` } });
}
Retrieve similar chunks
export async function retrieve(query: string, topK = 5) {
const { embedding } = await embed({ model: embeddingModel, value: query });
// Cosine similarity search via pgvector operator
const results = await db.execute(sql`
SELECT id, source, section_title, content,
1 - (embedding <=> ${JSON.stringify(embedding)}::vector) AS score
FROM doc_chunks
ORDER BY embedding <=> ${JSON.stringify(embedding)}::vector
LIMIT ${topK}
`);
return results.rows.map((r) => `[${r.source}]\n${r.content}`);
}
Webhook re-index on deploy
Trigger re-indexing when docs change — a Next.js route handler that a CI step or GitHub webhook can call:
// app/api/reindex/route.ts
import { NextRequest } from "next/server";
import { indexDocuments } from "@/lib/rag-pg";
import { loadAndChunkDocs } from "@/lib/docs-loader";
export async function POST(req: NextRequest) {
// Verify shared secret to prevent unauthorized re-indexing
const secret = req.headers.get("x-reindex-secret");
if (secret !== process.env.REINDEX_SECRET) {
return new Response("Unauthorized", { status: 401 });
}
const docs = await loadAndChunkDocs(); // load markdown from /content or a CMS
await indexDocuments(docs);
return Response.json({ indexed: docs.length });
}
Call it from your CI/CD pipeline after docs deploy:
# .github/workflows/deploy.yml — after docs build step
curl -X POST https://yourapp.com/api/reindex \
-H "x-reindex-secret: $REINDEX_SECRET"
pgvector docs: pgvector on GitHub · Neon pgvector guide · Supabase Vector.
MCP as Lightweight RAG
An MCP server can expose search_internal_docs(query) that runs your search backend — simpler than full vector infra for small teams. See Building MCP Servers and MCP Server Tutorial.
For Teams
| Concern | Guidance |
|---|---|
| Data residency | Confirm where embeddings and chunks are stored (region, vendor SOC2) |
| Secrets in repos | Never index .env, credentials, or customer PII — use .cursorignore and index allowlists |
| Access control | Filter retrieval by user role; do not embed docs the requester cannot read |
| Approved pipelines | Document which vector DB and embedding models are allowed in Team AI Policy |