// a thing I explored

RAG.

Retrieval-Augmented Generation —
teaching LLMs to look things up before they speak

~15 min walk-through
02 / 12 — the problem

LLMs are frozen in time.

A language model learns from a training dataset with a cutoff date. After that — it knows nothing new. Ask it about last week's news, your internal docs, or anything private? It'll confidently make something up.

We needed a way to give models fresh, relevant context at inference time — without retraining them.

03 / 12 — definition
RAG in one line
"Before answering, go find the relevant docs — then use them."

RAG = Retrieve relevant chunks from a knowledge base, Augment the prompt with them, then let the LLM Generate a grounded response.

Introduced by Lewis et al. (Meta AI, 2020). Now practically everywhere.

04 / 12 — architecture

The pipeline

📄
Docs
✂️
Chunk
🔢
Embed
🗄️
Vector DB
User Query
🔢
Embed Query
🔍
Retrieve Top-K
🤖
LLM + Context
💬
Answer

Two phases: offline indexing (happens once) and online retrieval (happens per query).

05 / 12 — embeddings

What's an embedding?

An embedding is a dense vector — a list of numbers — that captures the semantic meaning of text. Similar meaning → vectors that are close in space.

  • "dog" and "puppy" are close
  • "king" − "man" + "woman" ≈ "queen"
  • Typically 768 – 3072 dimensions
  • Models: text-embedding-3-small, BGE, etc.
animals royalty tech
vectors cluster by meaning
06 / 12 — vector database

The Vector DB

A database optimized for storing and searching vectors by similarity — not by exact key matches.

search type
ANN
Approximate Nearest Neighbor — fast enough for prod
metric
Cosine similarity
How "parallel" two vectors are (0–1)
popular options
Pinecone / Weaviate
Also: pgvector, Chroma, Qdrant, FAISS

You embed the query → search the DB → get the Top-K most semantically similar chunks back.

07 / 12 — chunking

Chunking matters a lot.

Before embedding, docs are split into smaller pieces. How you chunk changes retrieval quality significantly.

A bad chunk strategy is one of the most common reasons RAG feels broken. Garbage in, garbage out.

08 / 12 — augmentation

Stuffing context into the prompt

The retrieved chunks get injected into the prompt alongside the user's question. The LLM is told to answer using only this context.

// simplified prompt template

system: "Answer only using the context below.
  If the answer isn't there, say you don't know."


context: [chunk_1, chunk_2, chunk_3]

question: "<user's query>"

This grounds the model. It's citing real content, not recalling from weights.

09 / 12 — comparison

RAG vs Fine-tuning

RAG
  • ✓ Knowledge updated by swapping docs
  • ✓ Cites sources, more auditable
  • ✓ No GPU training needed
  • ✓ Works well for factual Q&A
  • ✗ Slower at inference (retrieval step)
  • ✗ Context window is a constraint
vs
Fine-tuning
  • ✓ Model "learns" style / tone / format
  • ✓ No retrieval latency
  • ✓ Good for behavior adaptation
  • ✗ Expensive to train & update
  • ✗ Can still hallucinate
  • ✗ Opaque — hard to audit
10 / 12 — use cases

Where it's actually used

📚 Enterprise
Internal KB bots
HR docs, runbooks, Confluence Q&A
⚖️ Legal / Medical
Document analysis
Case law, clinical guidelines, research
💻 Dev tools
Code search
GitHub Copilot context, codebase Q&A
🛒 E-commerce
Product assistants
Answer from live catalog & policies

Any domain where you have a corpus of docs and need accurate, citable answers.

11 / 12 — limitations

It's not magic.

Active research area: HyDE, FLARE, multi-hop RAG, re-ranking, GraphRAG.

12 / 12 — wrap-up

TL;DR.

Problem
LLMs lack fresh / private knowledge
Solution
Retrieve relevant docs at query time
Key components
Embeddings + Vector DB + LLM
Chunking
How you split docs changes everything
vs Fine-tuning
RAG for facts; FT for behavior
Limitations
Retrieval quality is the bottleneck

thanks for listening — questions?

← → or arrow keys to navigate