What is RAG?

📖

Think of it like an open-book exam

Imagine you're taking an exam. In a closed-book exam, you rely only on what you memorized — sometimes you're right, sometimes you confidently write something wrong. That's how a regular LLM works: it only knows what it learned during training.

Now imagine an open-book exam. Before answering, you quickly flip to the right page in your textbook, read the relevant paragraph, and then write your answer. That's RAG — it lets the AI look things up before answering, so the response is grounded in real, up-to-date information.

So what exactly is RAG?

RAG stands for Retrieval-Augmented Generation. It's a technique that combines two things:

1. Retrieval — Before the AI generates an answer, it first searches through a knowledge base (your documents, a database, a website) to find the most relevant pieces of information.

2. Generation — The AI then uses those retrieved pieces as context to write a response that's actually grounded in your data — not just its training memory.

The magic is in the combination: the AI still writes fluently like a language model, but now it has receipts — real sources to back up what it says.

How it works — step by step

Click "Under the hood" on any step to see what's really happening behind the scenes.

1

💬

You ask a question

"What's our refund policy for digital products?"

Your question is first turned into an embedding — a list of numbers (a vector) that captures the meaning of what you asked, not just the exact words.

Think of it like translating your sentence into coordinates on a map of meaning. Questions about similar topics end up close together on this map, even if the words are completely different.

Example: "refund policy" and "money-back guarantee" are far apart as text, but their embeddings are almost identical — because they mean the same thing.

This embedding is generated by a specialized model (like OpenAI's text-embedding-3-small or open-source alternatives like all-MiniLM-L6). It typically produces a vector of 384–1536 numbers.

2

🔍

Retriever searches your knowledge base

Finds the 3–5 most relevant chunks from your documents

This is where vector search happens. Your documents were already split into small chunks (paragraphs or sections) and each chunk was turned into an embedding — stored in a vector database like Pinecone, Weaviate, Chroma, or pgvector.

Now the system compares your question's embedding to every chunk's embedding using cosine similarity — essentially measuring how "close" two vectors point in the same direction.

Analogy: Imagine a library where every book has a GPS coordinate based on its topic. Instead of searching by title, you say "find books near coordinate X" — and instantly get the most relevant ones, even if they use different words.

The top-k results (usually 3–5 chunks) are returned, ranked by relevance. This is sometimes called semantic search because it matches by meaning, not keywords.

3

📄

Context is assembled into a prompt

Retrieved text is combined with your question and instructions

The system builds a prompt that looks something like this:

System: Answer the user's question using ONLY the context below. If the answer isn't in the context, say you don't know.

Context:
[Chunk 1] "Refund requests must be submitted within 30 days of purchase..."
[Chunk 2] "Digital products (e-books, courses) are non-refundable..."
[Chunk 3] "To request a refund, email support@company.com..."

User: What's your refund policy for digital products?

This is called prompt augmentation — the "A" in RAG. The retrieved chunks give the LLM a focused "cheat sheet" so it doesn't have to rely on its training data. The system prompt also tells it to stay grounded in what it was given.

4

🤖

The LLM generates an answer

Reads the context and writes a coherent, grounded response

The language model (GPT-4, Claude, Gemini, Llama, etc.) processes the full prompt — your question + the retrieved context + the system instructions — and generates a response token by token.

Because the relevant information is right there in the prompt, the model is far less likely to hallucinate. It's essentially doing reading comprehension rather than recall from memory.

Key insight: The LLM itself doesn't "know" it's part of a RAG pipeline. It just sees a well-structured prompt with relevant context. The magic is in what you feed it, not in a special mode.

Some advanced setups also ask the LLM to cite its sources — e.g., "[Source: Refund Policy, Section 3.2]" — so users can verify the answer.

5

✅

You get an accurate, sourced answer

Grounded in your actual data — not a guess

The final response is delivered to the user, often with source references so they can click through and verify. In production systems, this step can also include:

• Guardrails: Checking the answer doesn't contain harmful or off-topic content.

• Confidence scoring: Flagging low-confidence answers for human review.

• Feedback loops: Tracking which answers users found helpful to improve retrieval over time.

Result: "Digital products are non-refundable according to our policy. For physical items, refunds can be requested within 30 days by emailing support@company.com. [Source: Refund Policy, updated Jan 2026]"

Why does this matter?

Regular LLMs have a big problem: they can hallucinate — confidently make up facts that sound true but aren't. RAG dramatically reduces this by anchoring every answer in real documents.

❌ Without RAG

"Your company offers a 60-day money-back guarantee on all products."

Sounds confident but made up — your actual policy is 30 days, digital products excluded.

✅ With RAG

"According to your refund policy (last updated Jan 2026), customers can request a full refund within 30 days of purchase. Digital products are non-refundable."

Accurate, sourced from the actual document, with the date included.

The 3 key ingredients

Every RAG system has three main building blocks:

📚

Knowledge base

Your documents, PDFs, wikis, databases — anything the AI should know about. These get split into small chunks and stored.

🧲

Retriever

A search engine that finds the most relevant chunks for each question. Usually powered by embeddings (a way to compare meaning, not just keywords).

🤖

Generator (LLM)

The language model (like GPT, Claude, or Gemini) that reads the retrieved context and writes a coherent, natural answer.

When should you use RAG?

RAG is especially useful when:

🏢

Internal knowledge

You want AI to answer questions about your company's docs, policies, or data.

📅

Up-to-date info

The information changes often and the LLM's training data is outdated.

🎯

Accuracy matters

You can't afford hallucinations — medical, legal, financial, or customer-facing use cases.

💰

Cost-effective

Fine-tuning a model is expensive. RAG lets you add knowledge without retraining.

💡

Quick recap

RAG = Search first, then answer. Instead of relying on memory alone, the AI looks up relevant information from your data before generating a response. It reduces hallucinations, keeps answers current, and works with any knowledge base you give it. Think of it as giving your AI a reference library instead of asking it to guess.

Test your understanding

Five quick questions to check what you've learned. No pressure — you can retry anytime.

🧪 RAG Quiz Question 1 / 5