Retrieval-Augmented Generation —
teaching LLMs to look things up before they speak
~15 min walk-through
02 / 12 — the problem
LLMs are frozen in time.
A language model learns from a training dataset with a cutoff date. After that — it knows nothing new. Ask it about last week's news, your internal docs, or anything private? It'll confidently make something up.
Knowledge cutoff → stale answers
No access to private / proprietary data
Hallucinations when context is missing
Fine-tuning is expensive and slow to update
We needed a way to give models fresh, relevant context at inference time — without retraining them.
03 / 12 — definition
RAG in one line
"Before answering, go find the relevant docs — then use them."
RAG = Retrieve relevant chunks from a knowledge base, Augment the prompt with them, then let the LLM Generate a grounded response.
Introduced by Lewis et al. (Meta AI, 2020). Now practically everywhere.
04 / 12 — architecture
The pipeline
📄
Docs
→
✂️
Chunk
→
🔢
Embed
→
🗄️
Vector DB
❓
User Query
→
🔢
Embed Query
→
🔍
Retrieve Top-K
→
🤖
LLM + Context
→
💬
Answer
Two phases: offline indexing (happens once) and online retrieval (happens per query).
05 / 12 — embeddings
What's an embedding?
An embedding is a dense vector — a list of numbers — that captures the semantic meaning of text. Similar meaning → vectors that are close in space.
"dog" and "puppy" are close
"king" − "man" + "woman" ≈ "queen"
Typically 768 – 3072 dimensions
Models: text-embedding-3-small, BGE, etc.
vectors cluster by meaning
06 / 12 — vector database
The Vector DB
A database optimized for storing and searching vectors by similarity — not by exact key matches.
search type
ANN
Approximate Nearest Neighbor — fast enough for prod
metric
Cosine similarity
How "parallel" two vectors are (0–1)
popular options
Pinecone / Weaviate
Also: pgvector, Chroma, Qdrant, FAISS
You embed the query → search the DB → get the Top-K most semantically similar chunks back.
07 / 12 — chunking
Chunking matters a lot.
Before embedding, docs are split into smaller pieces. How you chunk changes retrieval quality significantly.
Fixed-size chunking — simple, 512 tokens with overlap. Fast but ignores structure.
Semantic chunking — split on meaning breaks, not character count.
Recursive splitting — try paragraph → sentence → word boundaries.
Context-aware — keep headings attached to their content.
A bad chunk strategy is one of the most common reasons RAG feels broken. Garbage in, garbage out.
08 / 12 — augmentation
Stuffing context into the prompt
The retrieved chunks get injected into the prompt alongside the user's question. The LLM is told to answer using only this context.
// simplified prompt template
system: "Answer only using the context below. If the answer isn't there, say you don't know."
context: [chunk_1, chunk_2, chunk_3]
question: "<user's query>"
This grounds the model. It's citing real content, not recalling from weights.
09 / 12 — comparison
RAG vs Fine-tuning
RAG
✓ Knowledge updated by swapping docs
✓ Cites sources, more auditable
✓ No GPU training needed
✓ Works well for factual Q&A
✗ Slower at inference (retrieval step)
✗ Context window is a constraint
vs
Fine-tuning
✓ Model "learns" style / tone / format
✓ No retrieval latency
✓ Good for behavior adaptation
✗ Expensive to train & update
✗ Can still hallucinate
✗ Opaque — hard to audit
10 / 12 — use cases
Where it's actually used
📚 Enterprise
Internal KB bots
HR docs, runbooks, Confluence Q&A
⚖️ Legal / Medical
Document analysis
Case law, clinical guidelines, research
💻 Dev tools
Code search
GitHub Copilot context, codebase Q&A
🛒 E-commerce
Product assistants
Answer from live catalog & policies
Any domain where you have a corpus of docs and need accurate, citable answers.
11 / 12 — limitations
It's not magic.
Retrieval can fail — if the right chunk isn't fetched, the answer is wrong.
Context window limits — you can only stuff so many chunks.