// a thing I explored

RAG.

Retrieval-Augmented Generation —
teaching LLMs to look things up before they speak

~15 min walk-through

02 / 12 — the problem

LLMs are frozen in time.

A language model learns from a training dataset with a cutoff date. After that — it knows nothing new. Ask it about last week's news, your internal docs, or anything private? It'll confidently make something up.

Knowledge cutoff → stale answers
No access to private / proprietary data
Hallucinations when context is missing
Fine-tuning is expensive and slow to update

We needed a way to give models fresh, relevant context at inference time — without retraining them.

03 / 12 — definition

RAG in one line

"Before answering, go find the relevant docs — then use them."

RAG = Retrieve relevant chunks from a knowledge base, Augment the prompt with them, then let the LLM Generate a grounded response.

Introduced by Lewis et al. (Meta AI, 2020). Now practically everywhere.

04 / 12 — architecture

The pipeline

📄

Docs

→

✂️

Chunk

→

🔢

Embed

→

🗄️

Vector DB

❓

User Query

→

🔢

Embed Query

→

🔍

Retrieve Top-K

→

🤖

LLM + Context

→

💬

Answer

Two phases: offline indexing (happens once) and online retrieval (happens per query).

05 / 12 — embeddings

What's an embedding?

An embedding is a dense vector — a list of numbers — that captures the semantic meaning of text. Similar meaning → vectors that are close in space.

"dog" and "puppy" are close
"king" − "man" + "woman" ≈ "queen"
Typically 768 – 3072 dimensions
Models: text-embedding-3-small, BGE, etc.

vectors cluster by meaning

06 / 12 — vector database

The Vector DB

A database optimized for storing and searching vectors by similarity — not by exact key matches.

search type

ANN

Approximate Nearest Neighbor — fast enough for prod

metric

Cosine similarity

How "parallel" two vectors are (0–1)

popular options

Pinecone / Weaviate

Also: pgvector, Chroma, Qdrant, FAISS

You embed the query → search the DB → get the Top-K most semantically similar chunks back.

07 / 12 — chunking

Chunking matters a lot.

Before embedding, docs are split into smaller pieces. How you chunk changes retrieval quality significantly.

Fixed-size chunking — simple, 512 tokens with overlap. Fast but ignores structure.
Semantic chunking — split on meaning breaks, not character count.
Recursive splitting — try paragraph → sentence → word boundaries.
Context-aware — keep headings attached to their content.

A bad chunk strategy is one of the most common reasons RAG feels broken. Garbage in, garbage out.

08 / 12 — augmentation

Stuffing context into the prompt

The retrieved chunks get injected into the prompt alongside the user's question. The LLM is told to answer using only this context.

// simplified prompt template

system: "Answer only using the context below.
If the answer isn't there, say you don't know."

context: [chunk_1, chunk_2, chunk_3]

question: "<user's query>"

This grounds the model. It's citing real content, not recalling from weights.

09 / 12 — comparison

RAG vs Fine-tuning

RAG

✓ Knowledge updated by swapping docs
✓ Cites sources, more auditable
✓ No GPU training needed
✓ Works well for factual Q&A
✗ Slower at inference (retrieval step)
✗ Context window is a constraint

vs

Fine-tuning

✓ Model "learns" style / tone / format
✓ No retrieval latency
✓ Good for behavior adaptation
✗ Expensive to train & update
✗ Can still hallucinate
✗ Opaque — hard to audit

10 / 12 — use cases

Where it's actually used

📚 Enterprise

Internal KB bots

HR docs, runbooks, Confluence Q&A

⚖️ Legal / Medical

Document analysis

Case law, clinical guidelines, research

💻 Dev tools

Code search

GitHub Copilot context, codebase Q&A

🛒 E-commerce

Product assistants

Answer from live catalog & policies

Any domain where you have a corpus of docs and need accurate, citable answers.

11 / 12 — limitations

It's not magic.

Retrieval can fail — if the right chunk isn't fetched, the answer is wrong.
Context window limits — you can only stuff so many chunks.
Chunking artifacts — splitting mid-sentence loses meaning.
Latency — embed query → search → LLM adds time.
LLM can still ignore context — model may rely on priors.
Eval is hard — measuring retrieval + generation quality jointly is non-trivial.

Active research area: HyDE, FLARE, multi-hop RAG, re-ranking, GraphRAG.

12 / 12 — wrap-up

TL;DR.

Problem

LLMs lack fresh / private knowledge

Solution

Retrieve relevant docs at query time

Key components

Embeddings + Vector DB + LLM

Chunking

How you split docs changes everything

vs Fine-tuning

RAG for facts; FT for behavior

Limitations

Retrieval quality is the bottleneck

thanks for listening — questions?