Mack Grissom

┌──────────────────────┐
│ ░░░░░░░░░░░░░░░░░░░░ │
└──────────────────────┘
       0% complete
Mack Grissom
Back to blog

RAG Explained: How to Make AI Actually Useful With Your Data

·2 min read
AIRAGDevelopmentTutorial

The biggest complaint about AI in business? "It makes stuff up." And yeah, that's a real problem. But it's a solvable one. Most implementations fail because they're using LLMs without grounding them in real data. RAG (Retrieval Augmented Generation) fixes this, and it's easier to build than most people think.

The Problem With Vanilla LLMs

Ask ChatGPT about your company's return policy and it doesn't know. It'll either refuse to answer or confidently make one up. This is why naive AI integrations feel useless. The model simply doesn't have your data.

How RAG Works

The concept is straightforward:

  1. Chunk your data: Break your documents, knowledge base, or database into small, meaningful pieces
  2. Create embeddings: Convert each chunk into a vector (basically a mathematical fingerprint of its meaning)
  3. Store in a vector database: Index these embeddings for fast retrieval
  4. At query time: Convert the user's question into a vector, find the most relevant chunks, and include them in the LLM prompt
  5. Generate a grounded response: The LLM answers based on your actual data, not its training data

A Real-World Setup

Here's the simplified architecture I use for client projects:

User Question
  > Generate embedding (OpenAI text-embedding-3-small)
  > Vector similarity search (Supabase pgvector)
  > Retrieve top 5 relevant chunks
  > Construct prompt: System instructions + Retrieved context + User question
  > Send to Claude for response generation
  > Stream response to frontend

Key Decisions

Chunk Size

Too small and you lose context. Too large and you dilute relevance. I typically use 500-1000 tokens with 100-token overlap between chunks.

Embedding Model

OpenAI's text-embedding-3-small is the sweet spot for most use cases. Fast, cheap, accurate enough. Only upgrade to the large model if search quality is make-or-break.

Vector Database

For most projects, Supabase pgvector is my go-to. Free to start, runs alongside your existing Postgres data, zero additional infrastructure. Pinecone or Weaviate if you're operating at larger scale.

Retrieval Count

Start with 3-5 chunks. More context helps accuracy but increases token costs and can confuse the model if chunks contradict each other. Test and measure.

Common Pitfalls

  • Stale data: Set up a pipeline to re-index when your source data changes
  • Bad chunking: Don't split mid-sentence. Respect document structure like headings and paragraphs.
  • No evaluation: Build a test set of questions with known-good answers and measure retrieval accuracy
  • Ignoring metadata: Filter by category, date, or user permissions before doing similarity search

The Result

A well-built RAG system turns a generic AI chatbot into a domain expert that actually knows your business. Customers get accurate answers, support teams handle fewer tickets, and your AI feature goes from "interesting demo" to something people rely on daily.