AI in Web Apps
Using AI to write code is one thing. Shipping AI inside your product is another — keys, latency, streaming, auth, and session memory all land on you.
Last reviewed: June 2026
Patterns below assume Next.js or similar full-stack JS; principles apply to any backend.
Client vs Server
| Run AI on | Good for | Never do |
|---|---|---|
| Server | Production chat, RAG, tool calling, billing | — |
| Client (browser) | Demos, local models via WebGPU | Expose API secret keys |
| Edge | Low-latency streaming close to users | Heavy RAG without caching |
Rule: API keys live on the server only. The browser talks to your API route; your route talks to OpenAI/Anthropic.
Architecture Pattern
Browser → POST /api/chat → Your server → LLM provider
↑
auth, rate limit, logging
Auth: Require a Session Before the LLM Call
Never expose an unauthenticated route to an LLM provider. Every request should verify identity before spending tokens.
// app/api/chat/route.ts
import { auth } from "@/lib/auth";
import { anthropic } from "@ai-sdk/anthropic";
import { streamText } from "ai";
export async function POST(req: Request) {
// Auth check first — before parsing body or calling LLM
const session = await auth();
if (!session?.user) {
return new Response("Unauthorized", { status: 401 });
}
const { messages } = await req.json();
const result = streamText({
model: anthropic("claude-sonnet-4-20250514"),
messages,
maxTokens: 1024,
});
return result.toDataStreamResponse();
}
For paid features, also enforce your billing/quota check before the LLM call — not after.
Session Memory and Multi-Turn Chat
A stateless API route receives the full conversation history on each request. For short sessions this is fine; for long-running assistants you need to manage history size.
Client-managed history (simple)
The useChat hook from @ai-sdk/react holds history in React state. This is the default pattern — history lives in the browser, resets on page reload.
"use client";
import { useChat } from "@ai-sdk/react";
export function Chat() {
const { messages, input, handleInputChange, handleSubmit, isLoading } =
useChat({ api: "/api/chat" });
return (
<form onSubmit={handleSubmit}>
{messages.map((m) => (
<div key={m.id}>{m.role}: {m.content}</div>
))}
<input value={input} onChange={handleInputChange} disabled={isLoading} />
<button type="submit" disabled={isLoading}>Send</button>
</form>
);
}
Server-persisted history (production)
For history that survives page reloads and works across devices, persist messages server-side:
// lib/chat-store.ts — save and load conversation history
import { db } from "@/lib/db";
import type { CoreMessage } from "ai";
export async function saveMessages(sessionId: string, messages: CoreMessage[]) {
await db.chatSessions.upsert({
where: { id: sessionId },
update: { messages: JSON.stringify(messages), updatedAt: new Date() },
create: { id: sessionId, userId: getCurrentUserId(), messages: JSON.stringify(messages) },
});
}
export async function loadMessages(sessionId: string): Promise<CoreMessage[]> {
const session = await db.chatSessions.findUnique({ where: { id: sessionId } });
return session ? JSON.parse(session.messages) : [];
}
// app/api/chat/route.ts — load, append, save
export async function POST(req: Request) {
const session = await auth();
if (!session?.user) return new Response("Unauthorized", { status: 401 });
const { sessionId, newMessage } = await req.json();
// Load history, append new user message
const history = await loadMessages(sessionId);
const messages = [...history, newMessage];
// Truncate to last 20 turns to control token cost
const trimmed = messages.slice(-20);
const result = streamText({
model: anthropic("claude-sonnet-4-20250514"),
messages: trimmed,
maxTokens: 1024,
onFinish: async ({ text }) => {
// Persist the updated history including the assistant reply
await saveMessages(sessionId, [
...trimmed,
{ role: "assistant", content: text },
]);
},
});
return result.toDataStreamResponse();
}
See Cost, Latency, and Tokens for history truncation patterns.
Streaming UX
Users abandon chat UIs that wait 10 seconds for a full response.
- Stream tokens with Vercel AI SDK
useChat() - Show typing indicator on first token
- Allow cancel (
AbortController) - Handle errors gracefully — partial responses happen
const { stop, isLoading } = useChat({ api: "/api/chat" });
// Cancel button
{isLoading && (
<button onClick={() => stop()} type="button">Cancel</button>
)}
Rate Limiting and Abuse Prevention
Every AI endpoint needs per-user limits:
// lib/rate-limit.ts — in-memory (use Redis / Upstash in production)
const hits = new Map<string, number[]>();
export function checkRateLimit(
key: string,
max = 20,
windowMs = 60_000
): boolean {
const now = Date.now();
const recent = (hits.get(key) ?? []).filter((t) => now - t < windowMs);
if (recent.length >= max) return false;
recent.push(now);
hits.set(key, recent);
return true;
}
// In your route handler
const userId = session.user.id;
if (!checkRateLimit(userId)) {
return new Response("Too many requests", { status: 429 });
}
For production, use Upstash Rate Limit or a similar Redis-backed solution that works across serverless instances.
Client-Side ML (Browser)
For on-device inference (TensorFlow.js, transformers.js):
- Models are large — lazy-load
- Privacy-friendly (data stays local)
- Quality tradeoff vs cloud LLMs
Use for classification/embeddings in the browser, not as a replacement for server chat in most products.