RAG for Codebases

RAG (retrieval-augmented generation) feeds relevant snippets into the model at query time so answers ground in your docs and code — not stale training data.

Last reviewed: June 2026

Embedding models and vector DB APIs change often. Verify model IDs and pricing against OpenAI embeddings docs and your provider before production.

Prerequisites

You understand How LLMs Work and token budgets. For IDE-only retrieval, start with Context Engineering before building custom pipelines.

When You Need RAG

Situation	RAG helps?
Large monorepo; model misses obscure modules	Yes
Internal APIs not on the public web	Yes
Small repo; @Files works fine	Probably not
Bleeding-edge public docs	@Web or MCP may suffice
User-facing product chat over your KB	Yes (production RAG pipeline)

Build vs Buy

Approach	Pros	Cons
IDE indexing (Cursor, etc.)	Zero setup; follows git ignore	Vendor-specific; limited customization
MCP doc server	You control sources; works across clients	You maintain the server
Custom vector DB	Full control; multi-tenant products	Infra cost; chunking tuning
Hosted (Pinecone, etc.)	Managed scaling	Cost; data residency questions

For personal coding: start with IDE indexing + MCP. Build custom RAG when shipping a product feature.

Embedding Pipeline (Custom RAG)

flowchart TD
    docs[Docs and code files] --> chunk[Chunk text]
    chunk --> embed[Embedding API]
    embed --> store[Vector store]
    query[User question] --> qembed[Embed query]
    qembed --> search[Similarity search]
    store --> search
    search --> prompt[Inject into prompt]
    prompt --> llm[LLM response]

Primary references: OpenAI embeddings guide, OpenAI API for Web Developers, Anthropic docs on retrieval, Pinecone documentation.

Minimal RAG Example (In-Memory)

This runnable pattern mirrors the diagram above using the Vercel AI SDK and OpenAI embeddings. For production, swap the in-memory store for Pinecone, pgvector, or another vector DB.

// lib/rag.ts — index docs at startup or on webhook
import { embed, embedMany, cosineSimilarity } from "ai";
import { openai } from "@ai-sdk/openai";

const embeddingModel = openai.embedding("text-embedding-3-small");

type Chunk = { id: string; text: string; source: string; embedding: number[] };

const store: Chunk[] = [];

export async function indexDocuments(docs: { id: string; text: string; source: string }[]) {
  const { embeddings } = await embedMany({
    model: embeddingModel,
    values: docs.map((d) => d.text),
  });
  docs.forEach((doc, i) => {
    store.push({ ...doc, embedding: embeddings[i] });
  });
}

export async function retrieve(query: string, topK = 5): Promise<string[]> {
  const { embedding } = await embed({ model: embeddingModel, value: query });
  const ranked = store
    .map((chunk) => ({
      chunk,
      score: cosineSimilarity(embedding, chunk.embedding),
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
  return ranked.map(({ chunk }) => `[${chunk.source}]\n${chunk.text}`);
}

Use retrieved chunks in your chat route:

// app/api/chat/route.ts
import { anthropic } from "@ai-sdk/anthropic";
import { streamText } from "ai";
import { retrieve } from "@/lib/rag";

export async function POST(req: Request) {
  const { messages } = await req.json();
  const lastUser = messages.at(-1)?.content ?? "";
  const context = (await retrieve(lastUser)).join("\n\n---\n\n");

  const result = streamText({
    model: anthropic("claude-sonnet-4-20250514"),
    messages,
    system: `Answer using only the context below. If the answer is not in context, say so.\n\n${context}`,
  });

  return result.toDataStreamResponse();
}

See Streaming Chat Tutorial for a full Next.js walkthrough and LLM APIs for streaming patterns.

Chunking Walkthrough

Before: one 4,000-token markdown file pasted whole — blows context and dilutes retrieval.

After: split on ## headings with metadata:

Chunk	Source	Tokens (approx)
`# Auth\n\nOAuth flow…`	`docs/auth.md`	~400
`## Refresh tokens\n\n…`	`docs/auth.md#refresh`	~350
`## Session storage\n\n…`	`docs/auth.md#session`	~420

For source code, chunk on function boundaries and prefix each chunk with the file path:

src/lib/billing.ts :: function calculateProration(...)

Bad vs good chunking

Bad: Fixed 512-character splits mid-sentence — retrieval returns fragments without meaning.

Good: Semantic splits (headings, functions) with file_path and section_title metadata — users and models can cite sources.

Chunking Strategies

Content type	Chunk approach
Markdown docs	Split on headings; ~500–1000 tokens
Source code	Split on functions/classes; include file path in metadata
API references	One endpoint per chunk
Logs / tickets	Split on timestamp or entry

Include metadata: file_path, section_title, last_updated.

Evaluating Retrieval Quality

Bad RAG is worse than no RAG — irrelevant chunks confuse the model.

Spot-check set

Build 10–20 questions with known answers in your docs. For each question:

Run retrieval with topK = 5
Ask: does any chunk contain the answer verbatim or by paraphrase?
Record pass/fail

Target ≥80% recall@5 on your spot-check set before shipping.

Failure modes

Symptom	Likely cause	Fix
Model says "I don't know" but doc exists	Chunk too large or poorly split	Re-chunk; add headings
Wrong module cited	Ambiguous names in code	Add path metadata to every chunk
Stale answers	Index not updated	Webhook re-index on merge to main
Confident wrong answer	Irrelevant chunk ranked high	Increase `topK` filtering; hybrid search (keyword + vector)

Production monitoring

Log: query, top chunk IDs, similarity scores, user thumbs-down. Review weekly. Exclude archived paths at index time.

Operational checklist:

Evaluate retrieval — do top-5 chunks contain the answer?
Refresh index on doc changes (webhook on merge to main)
Filter stale — exclude archived docs
Cite sources in UI so users verify

pgvector (Production Vector Store)

For most Next.js apps with Postgres, pgvector is the lowest-friction production upgrade from in-memory RAG — no extra infra, same database you already have.

npm install @vercel/postgres drizzle-orm drizzle-kit
# Enable pgvector in your Postgres instance (Neon, Supabase, RDS, etc.)
# CREATE EXTENSION IF NOT EXISTS vector;

Schema (Drizzle)

// lib/db/schema.ts
import { pgTable, text, varchar, vector, index } from "drizzle-orm/pg-core";

export const docChunks = pgTable(
  "doc_chunks",
  {
    id: varchar("id", { length: 128 }).primaryKey(),
    source: text("source").notNull(),       // file path or URL
    sectionTitle: text("section_title"),
    content: text("content").notNull(),
    embedding: vector("embedding", { dimensions: 1536 }).notNull(),
  },
  (t) => ({
    // HNSW index — faster queries at slight recall tradeoff
    embeddingIdx: index("doc_chunks_embedding_idx").using(
      "hnsw",
      t.embedding.op("vector_cosine_ops")
    ),
  })
);

Index documents

// lib/rag-pg.ts
import { db } from "@/lib/db";
import { docChunks } from "@/lib/db/schema";
import { embedMany } from "ai";
import { openai } from "@ai-sdk/openai";
import { sql } from "drizzle-orm";

const embeddingModel = openai.embedding("text-embedding-3-small");

export async function indexDocuments(
  docs: { id: string; source: string; sectionTitle?: string; content: string }[]
) {
  const { embeddings } = await embedMany({
    model: embeddingModel,
    values: docs.map((d) => d.content),
  });

  const rows = docs.map((doc, i) => ({ ...doc, embedding: embeddings[i] }));

  // Upsert so re-runs don't duplicate
  await db
    .insert(docChunks)
    .values(rows)
    .onConflictDoUpdate({ target: docChunks.id, set: { content: sql`excluded.content`, embedding: sql`excluded.embedding` } });
}

Retrieve similar chunks

export async function retrieve(query: string, topK = 5) {
  const { embedding } = await embed({ model: embeddingModel, value: query });

  // Cosine similarity search via pgvector operator
  const results = await db.execute(sql`
    SELECT id, source, section_title, content,
           1 - (embedding <=> ${JSON.stringify(embedding)}::vector) AS score
    FROM doc_chunks
    ORDER BY embedding <=> ${JSON.stringify(embedding)}::vector
    LIMIT ${topK}
  `);

  return results.rows.map((r) => `[${r.source}]\n${r.content}`);
}

Webhook re-index on deploy

Trigger re-indexing when docs change — a Next.js route handler that a CI step or GitHub webhook can call:

// app/api/reindex/route.ts
import { NextRequest } from "next/server";
import { indexDocuments } from "@/lib/rag-pg";
import { loadAndChunkDocs } from "@/lib/docs-loader";

export async function POST(req: NextRequest) {
  // Verify shared secret to prevent unauthorized re-indexing
  const secret = req.headers.get("x-reindex-secret");
  if (secret !== process.env.REINDEX_SECRET) {
    return new Response("Unauthorized", { status: 401 });
  }

  const docs = await loadAndChunkDocs(); // load markdown from /content or a CMS
  await indexDocuments(docs);

  return Response.json({ indexed: docs.length });
}

Call it from your CI/CD pipeline after docs deploy:

# .github/workflows/deploy.yml — after docs build step
curl -X POST https://yourapp.com/api/reindex \
  -H "x-reindex-secret: $REINDEX_SECRET"

pgvector docs: pgvector on GitHub · Neon pgvector guide · Supabase Vector.

MCP as Lightweight RAG

An MCP server can expose search_internal_docs(query) that runs your search backend — simpler than full vector infra for small teams. See Building MCP Servers and MCP Server Tutorial.

For Teams

Concern	Guidance
Data residency	Confirm where embeddings and chunks are stored (region, vendor SOC2)
Secrets in repos	Never index `.env`, credentials, or customer PII — use `.cursorignore` and index allowlists
Access control	Filter retrieval by user role; do not embed docs the requester cannot read
Approved pipelines	Document which vector DB and embedding models are allowed in Team AI Policy

Stop vibe-debugging.

RAG for Codebases

Prerequisites

When You Need RAG

Build vs Buy

Embedding Pipeline (Custom RAG)

Minimal RAG Example (In-Memory)

Chunking Walkthrough

Chunking Strategies

Evaluating Retrieval Quality

Spot-check set

Failure modes

Production monitoring

pgvector (Production Vector Store)

Schema (Drizzle)

Index documents

Retrieve similar chunks

Webhook re-index on deploy

MCP as Lightweight RAG

For Teams

Stop vibe-debugging.

On this page

Prerequisites

When You Need RAG

Build vs Buy

Embedding Pipeline (Custom RAG)

Minimal RAG Example (In-Memory)

Chunking Walkthrough

Chunking Strategies

Evaluating Retrieval Quality

pgvector (Production Vector Store)

MCP as Lightweight RAG

For Teams

Related