1. ai
  2. /building
  3. /rag

RAG for Codebases

RAG (retrieval-augmented generation) feeds relevant snippets into the model at query time so answers ground in your docs and code — not stale training data.

Last reviewed: June 2026

Embedding models and vector DB APIs change often. Verify model IDs and pricing against OpenAI embeddings docs and your provider before production.

Prerequisites

You understand How LLMs Work and token budgets. For IDE-only retrieval, start with Context Engineering before building custom pipelines.

When You Need RAG

SituationRAG helps?
Large monorepo; model misses obscure modulesYes
Internal APIs not on the public webYes
Small repo; @Files works fineProbably not
Bleeding-edge public docs@Web or MCP may suffice
User-facing product chat over your KBYes (production RAG pipeline)

Build vs Buy

ApproachProsCons
IDE indexing (Cursor, etc.)Zero setup; follows git ignoreVendor-specific; limited customization
MCP doc serverYou control sources; works across clientsYou maintain the server
Custom vector DBFull control; multi-tenant productsInfra cost; chunking tuning
Hosted (Pinecone, etc.)Managed scalingCost; data residency questions

For personal coding: start with IDE indexing + MCP. Build custom RAG when shipping a product feature.

Embedding Pipeline (Custom RAG)

flowchart TD
    docs[Docs and code files] --> chunk[Chunk text]
    chunk --> embed[Embedding API]
    embed --> store[Vector store]
    query[User question] --> qembed[Embed query]
    qembed --> search[Similarity search]
    store --> search
    search --> prompt[Inject into prompt]
    prompt --> llm[LLM response]

Primary references: OpenAI embeddings guide, OpenAI API for Web Developers, Anthropic docs on retrieval, Pinecone documentation.

Minimal RAG Example (In-Memory)

This runnable pattern mirrors the diagram above using the Vercel AI SDK and OpenAI embeddings. For production, swap the in-memory store for Pinecone, pgvector, or another vector DB.

// lib/rag.ts — index docs at startup or on webhook
import { embed, embedMany, cosineSimilarity } from "ai";
import { openai } from "@ai-sdk/openai";

const embeddingModel = openai.embedding("text-embedding-3-small");

type Chunk = { id: string; text: string; source: string; embedding: number[] };

const store: Chunk[] = [];

export async function indexDocuments(docs: { id: string; text: string; source: string }[]) {
  const { embeddings } = await embedMany({
    model: embeddingModel,
    values: docs.map((d) => d.text),
  });
  docs.forEach((doc, i) => {
    store.push({ ...doc, embedding: embeddings[i] });
  });
}

export async function retrieve(query: string, topK = 5): Promise<string[]> {
  const { embedding } = await embed({ model: embeddingModel, value: query });
  const ranked = store
    .map((chunk) => ({
      chunk,
      score: cosineSimilarity(embedding, chunk.embedding),
    }))
    .sort((a, b) => b.score - a.score)
    .slice(0, topK);
  return ranked.map(({ chunk }) => `[${chunk.source}]\n${chunk.text}`);
}

Use retrieved chunks in your chat route:

// app/api/chat/route.ts
import { anthropic } from "@ai-sdk/anthropic";
import { streamText } from "ai";
import { retrieve } from "@/lib/rag";

export async function POST(req: Request) {
  const { messages } = await req.json();
  const lastUser = messages.at(-1)?.content ?? "";
  const context = (await retrieve(lastUser)).join("\n\n---\n\n");

  const result = streamText({
    model: anthropic("claude-sonnet-4-20250514"),
    messages,
    system: `Answer using only the context below. If the answer is not in context, say so.\n\n${context}`,
  });

  return result.toDataStreamResponse();
}

See Streaming Chat Tutorial for a full Next.js walkthrough and LLM APIs for streaming patterns.

Chunking Walkthrough

Before: one 4,000-token markdown file pasted whole — blows context and dilutes retrieval.

After: split on ## headings with metadata:

ChunkSourceTokens (approx)
# Auth\n\nOAuth flow…docs/auth.md~400
## Refresh tokens\n\n…docs/auth.md#refresh~350
## Session storage\n\n…docs/auth.md#session~420

For source code, chunk on function boundaries and prefix each chunk with the file path:

src/lib/billing.ts :: function calculateProration(...)
Bad vs good chunking

Bad: Fixed 512-character splits mid-sentence — retrieval returns fragments without meaning.

Good: Semantic splits (headings, functions) with file_path and section_title metadata — users and models can cite sources.

Chunking Strategies

Content typeChunk approach
Markdown docsSplit on headings; ~500–1000 tokens
Source codeSplit on functions/classes; include file path in metadata
API referencesOne endpoint per chunk
Logs / ticketsSplit on timestamp or entry

Include metadata: file_path, section_title, last_updated.

Evaluating Retrieval Quality

Bad RAG is worse than no RAG — irrelevant chunks confuse the model.

Spot-check set

Build 10–20 questions with known answers in your docs. For each question:

  1. Run retrieval with topK = 5
  2. Ask: does any chunk contain the answer verbatim or by paraphrase?
  3. Record pass/fail

Target ≥80% recall@5 on your spot-check set before shipping.

Failure modes

SymptomLikely causeFix
Model says "I don't know" but doc existsChunk too large or poorly splitRe-chunk; add headings
Wrong module citedAmbiguous names in codeAdd path metadata to every chunk
Stale answersIndex not updatedWebhook re-index on merge to main
Confident wrong answerIrrelevant chunk ranked highIncrease topK filtering; hybrid search (keyword + vector)

Production monitoring

Log: query, top chunk IDs, similarity scores, user thumbs-down. Review weekly. Exclude archived paths at index time.

Operational checklist:

  • Evaluate retrieval — do top-5 chunks contain the answer?
  • Refresh index on doc changes (webhook on merge to main)
  • Filter stale — exclude archived docs
  • Cite sources in UI so users verify

pgvector (Production Vector Store)

For most Next.js apps with Postgres, pgvector is the lowest-friction production upgrade from in-memory RAG — no extra infra, same database you already have.

npm install @vercel/postgres drizzle-orm drizzle-kit
# Enable pgvector in your Postgres instance (Neon, Supabase, RDS, etc.)
# CREATE EXTENSION IF NOT EXISTS vector;

Schema (Drizzle)

// lib/db/schema.ts
import { pgTable, text, varchar, vector, index } from "drizzle-orm/pg-core";

export const docChunks = pgTable(
  "doc_chunks",
  {
    id: varchar("id", { length: 128 }).primaryKey(),
    source: text("source").notNull(),       // file path or URL
    sectionTitle: text("section_title"),
    content: text("content").notNull(),
    embedding: vector("embedding", { dimensions: 1536 }).notNull(),
  },
  (t) => ({
    // HNSW index — faster queries at slight recall tradeoff
    embeddingIdx: index("doc_chunks_embedding_idx").using(
      "hnsw",
      t.embedding.op("vector_cosine_ops")
    ),
  })
);

Index documents

// lib/rag-pg.ts
import { db } from "@/lib/db";
import { docChunks } from "@/lib/db/schema";
import { embedMany } from "ai";
import { openai } from "@ai-sdk/openai";
import { sql } from "drizzle-orm";

const embeddingModel = openai.embedding("text-embedding-3-small");

export async function indexDocuments(
  docs: { id: string; source: string; sectionTitle?: string; content: string }[]
) {
  const { embeddings } = await embedMany({
    model: embeddingModel,
    values: docs.map((d) => d.content),
  });

  const rows = docs.map((doc, i) => ({ ...doc, embedding: embeddings[i] }));

  // Upsert so re-runs don't duplicate
  await db
    .insert(docChunks)
    .values(rows)
    .onConflictDoUpdate({ target: docChunks.id, set: { content: sql`excluded.content`, embedding: sql`excluded.embedding` } });
}

Retrieve similar chunks

export async function retrieve(query: string, topK = 5) {
  const { embedding } = await embed({ model: embeddingModel, value: query });

  // Cosine similarity search via pgvector operator
  const results = await db.execute(sql`
    SELECT id, source, section_title, content,
           1 - (embedding <=> ${JSON.stringify(embedding)}::vector) AS score
    FROM doc_chunks
    ORDER BY embedding <=> ${JSON.stringify(embedding)}::vector
    LIMIT ${topK}
  `);

  return results.rows.map((r) => `[${r.source}]\n${r.content}`);
}

Webhook re-index on deploy

Trigger re-indexing when docs change — a Next.js route handler that a CI step or GitHub webhook can call:

// app/api/reindex/route.ts
import { NextRequest } from "next/server";
import { indexDocuments } from "@/lib/rag-pg";
import { loadAndChunkDocs } from "@/lib/docs-loader";

export async function POST(req: NextRequest) {
  // Verify shared secret to prevent unauthorized re-indexing
  const secret = req.headers.get("x-reindex-secret");
  if (secret !== process.env.REINDEX_SECRET) {
    return new Response("Unauthorized", { status: 401 });
  }

  const docs = await loadAndChunkDocs(); // load markdown from /content or a CMS
  await indexDocuments(docs);

  return Response.json({ indexed: docs.length });
}

Call it from your CI/CD pipeline after docs deploy:

# .github/workflows/deploy.yml — after docs build step
curl -X POST https://yourapp.com/api/reindex \
  -H "x-reindex-secret: $REINDEX_SECRET"

pgvector docs: pgvector on GitHub · Neon pgvector guide · Supabase Vector.

MCP as Lightweight RAG

An MCP server can expose search_internal_docs(query) that runs your search backend — simpler than full vector infra for small teams. See Building MCP Servers and MCP Server Tutorial.

For Teams

ConcernGuidance
Data residencyConfirm where embeddings and chunks are stored (region, vendor SOC2)
Secrets in reposNever index .env, credentials, or customer PII — use .cursorignore and index allowlists
Access controlFilter retrieval by user role; do not embed docs the requester cannot read
Approved pipelinesDocument which vector DB and embedding models are allowed in Team AI Policy