RAG: Open-Book LLMs

Part 17 of the AI/LLM mastery series — the biggest practical fix for hallucination and stale knowledge. Retrieval-Augmented Generation explained end to end: closed-book vs open-book, Retrieve+Augment+Generate, building the vector-DB index (chunk, embed, store), semantic search to fetch the nearest chunks, grounding the answer with citations, the full pipeline, and where RAG still breaks.

AI/LLM Mastery · Part 17 of 20 — the single biggest fix for hallucination and stale knowledge. Stop asking the model to recall facts from frozen memory; fetch the right documents and hand them to it at question-time. The open-book exam, explained end to end.

The ceiling that prompting cannot break

Part 16 ended on a hard ceiling: no prompt, however clever, can give a model facts it never learned. Its knowledge is frozen at a cutoff (Part 9), it has never seen your private documents, and at the edges of what it knows it confidently makes things up (Part 15). Fine-tuning, we saw in Part 12, is the wrong tool for facts. So how do you get an LLM to answer reliably about current events, or your company’s data? You stop relying on its memory entirely.

The answer is Retrieval-Augmented Generation — RAG — and it is the dominant pattern for serious LLM applications. The idea is simple enough to state in a sentence, and it leans entirely on the embeddings from Part 3. By the end you will understand the whole pipeline, why it slashes hallucination, and exactly where it still goes wrong.

Closed-book vs open-book

Picture the model as a student in an exam.

By default it sits a closed-book exam — answering purely from memory, which is frozen, incomplete, and prone to confident invention. The fix is almost embarrassingly obvious once you see it: turn it into an open-book exam. Fetch the documents relevant to the question and put them right into the prompt, so the model answers from the text in front of it rather than from fuzzy recall. It stops recalling and starts reading. That single shift — from memory to provided context — is the whole of RAG, and it is the most effective practical defence against hallucination there is.

RAG = Retrieve + Augment + Generate

The name spells out the recipe.

For every question, three steps run in order. Retrieve the documents most relevant to the question. Augment the prompt by pasting them in as context. Generate the answer from that provided text — grounded, and able to cite where each fact came from. The clever, non-obvious part is the first step: how do you find the relevant documents out of thousands, when the user’s wording may not match the documents’ wording at all? That is where Part 3 comes back to do the heavy lifting.

Offline: chunk, embed, store

Retrieval needs preparation. Before any questions are asked, you build a searchable index of your knowledge — a one-time offline job.

Take your documents — PDFs, a wiki, support tickets, a codebase — and split them into small chunks (a paragraph or a few). Chunking matters because at answer-time you want to retrieve and insert only the relevant bits, not whole books. Then embed each chunk into a vector using an embedding model (Part 3): its coordinates in meaning-space, where similar text lands close together. Finally, store all those vectors in a vector database — a store built to search across millions of vectors at speed (using approximate-nearest-neighbour algorithms). Examples you will hear: FAISS, pgvector, Pinecone, Weaviate. Crucially, this is done once, and you simply re-run it when your documents change. No model is retrained, ever.

Online: find the nearest chunks

Now a question arrives, and the online half begins.

First, embed the question into a vector with the same model — placing it on the very same meaning-map as your chunks. Then search the vector database for the chunks whose vectors sit closest to the question’s vector: nearest neighbours in meaning-space, where distance is similarity, exactly as in Part 3. This is semantic search, and its superpower is that it matches by meaning, not exact keywords — ask about a “car” and it happily finds a chunk about “automobiles,” because the two sit close in embedding space even with no shared word. (In practice it is often blended with old-fashioned keyword search — “hybrid” retrieval — for the best of both.) Grab the top-k closest chunks — say three to five — and those are your candidate facts.

Augment and generate, grounded

With the relevant chunks in hand, the last two steps are quick.

Augment the prompt by pasting the retrieved chunks in as context, then ask the question — using the anti-hallucination instruction from Part 16:

the augmented prompt
Use ONLY the context below to answer. If the answer is not in
the context, say "I don't know". Cite the source for each claim.

Context:
  [chunk 1]  ... retrieved passage ...
  [chunk 2]  ... retrieved passage ...
  [chunk 3]  ... retrieved passage ...

Question: How many vacation days do new employees get?

Now the model generates from the supplied passages — answering about your documents, with current information, not its frozen memory. And because every fact traces to a known chunk, the answer can cite its source, so a human can verify it. Hallucination drops sharply: the model is reading, the “say I don’t know” clause lets it abstain when the context is silent, and citations make any remaining errors easy to catch.

The whole RAG system

Here is the whole system in one place.

RAG, end to end
# OFFLINE (once, re-run when docs change)
for doc in documents:
    for chunk in split(doc):
        vector_db.add( embed(chunk), chunk )

# ONLINE (per question)
def answer(question):
    q_vec  = embed(question)
    chunks = vector_db.nearest(q_vec, k=5)     # semantic search
    prompt = build_prompt(chunks, question)    # augment
    return llm(prompt)                         # generate, grounded + cited

Two lanes: an offline lane that builds the index, and an online lane that answers each question by searching it. The payoff is the line you will repeat to every stakeholder: to update what the system knows, you just re-index the documents — no retraining, no fine-tuning. The model’s knowledge becomes something you can edit like a database, which is exactly why RAG took over production AI.

RAG is not magic: where it breaks

RAG is powerful, but it is not a magic wand, and the failure modes are worth knowing before you ship one.

The number-one rule: garbage retrieved is garbage answered. RAG’s output is only as good as its retrieval — pull the wrong chunks and the model grounds on the wrong facts and answers confidently wrong. Retrieval quality is the whole game. Chunking is a real tuning problem: too big wastes context and dilutes the signal, too small loses context. Even with good chunks, the model can still ignore or misread them, long contexts suffer “lost in the middle” (Part 15), and you can only stuff so much before hitting the context window (Part 14). So production RAG adds layers: a reranker (a second model that re-scores the retrieved chunks), hybrid keyword-plus-semantic search, and rigorous evaluation (Part 19). The honest framing: RAG sharply reduces hallucination — it does not erase it.

RAG — and from reading to acting

RAG, in one view:

The model’s memory is frozen, private-blind and hallucination-prone, so instead of asking it to recall, you Retrieve relevant text, Augment the prompt with it, and Generate a grounded, cited answer. Offline you chunk, embed and store your documents in a vector database; online you embed the question, fetch the nearest chunks, and feed them in. The benefits — fresh and private facts with no retraining, citations, far less hallucination — are why it is everywhere. And remember the division of labour from Part 12: fine-tune to change behaviour; use RAG to supply knowledge. Most real systems use both.

But notice what RAG can and cannot do. It gives the model knowledge it can read. Some jobs, though, need action: do exact arithmetic (which Part 15 showed it is bad at), run a piece of code, search the live web right now, query a database, or take a multi-step task and actually carry it out. For that the model has to use tools and take steps — deciding what to do, doing it, looking at the result, and continuing. That is the leap from a question-answerer to an agent, and it is Part 18.

Reactions

Related Articles