Lab 006: What is RAG?ΒΆ
What You'll LearnΒΆ
- Why LLMs need external knowledge (and why training data alone isn't enough)
- The complete RAG pipeline: ingest β chunk β embed β store β retrieve β generate
- The difference between keyword search, semantic search, and hybrid search
- When to use RAG vs. fine-tuning vs. just a bigger context window
- Real-world RAG architectures
IntroductionΒΆ
Imagine you've built an AI agent for your company. It answers questions beautifully β until a user asks about a product launched last month, or a policy updated last week.
The LLM doesn't know. Its training data has a cutoff date. And even if the information existed in training data, the model may not have memorized it accurately.
RAG (Retrieval-Augmented Generation) solves this by connecting the LLM to your own, up-to-date knowledge at query time β without retraining the model.
π¦ Supporting FilesΒΆ
Download these files before starting the lab
Save all files to a lab-006/ folder in your working directory.
| File | Description | Download |
|---|---|---|
faq_backpacks.txt |
FAQ knowledge base file | π₯ Download |
faq_clothing.txt |
FAQ knowledge base file | π₯ Download |
faq_footwear.txt |
FAQ knowledge base file | π₯ Download |
faq_sleeping_bags.txt |
FAQ knowledge base file | π₯ Download |
faq_tents.txt |
FAQ knowledge base file | π₯ Download |
Part 1: The Core ProblemΒΆ
LLMs have two knowledge limitations:
| Limitation | Description | Example |
|---|---|---|
| Training cutoff | Knowledge stops at a date | "What happened last week?" |
| Private data | Never saw your documents | "What's our refund policy?" |
| Hallucination risk | May confabulate when uncertain | Invents plausible-sounding but wrong answer |
The naive solution β "just put all your documents in the prompt" β doesn't scale. A 500-page manual is ~375,000 tokens. Most LLMs cap at 128,000 tokens, and even if they didn't, you'd pay for all those tokens on every single query.
RAG's answer: Only retrieve the relevant pieces, right when you need them.
Part 2: The RAG PipelineΒΆ
RAG has two distinct phases:
Phase 1 β Ingestion (runs once, or on schedule)ΒΆ
Phase 2 β Retrieval + Generation (runs on every query)ΒΆ
π€ Check Your Understanding
In the RAG pipeline, what is the difference between the ingestion phase and the retrieval phase, and how often does each run?
Answer
The ingestion phase (Load β Chunk β Embed β Store) runs once (or on a schedule) to prepare your documents. The retrieval phase (Embed query β Search β Augment β Generate) runs on every user query to find relevant chunks and generate an answer. Ingestion is a batch process; retrieval is real-time.
Part 3: Chunking StrategiesΒΆ
How you split documents matters enormously.
| Strategy | How | Best for |
|---|---|---|
| Fixed-size | Split every N tokens | Simple, fast, works for most cases |
| Sentence/paragraph | Split at natural boundaries | Better context preservation |
| Semantic | Split when topic changes | Best quality, more complex |
| Recursive | Try paragraph β sentence β word | Good default for mixed content |
Overlap is important. If you split at exactly 512 tokens, relevant information that spans a boundary gets lost. Add 50β100 token overlap between chunks.
Chunk 1: tokens 1β512
Chunk 2: tokens 462β974 β 50-token overlap
Chunk 3: tokens 924β1436 β 50-token overlap
π€ Check Your Understanding
Why is it important to add overlap between chunks when splitting documents?
Answer
Without overlap, relevant information that spans a chunk boundary gets split across two chunks and may be lost during retrieval. Adding 50β100 token overlap ensures that context at the edges is preserved in both adjacent chunks, improving retrieval quality.
Part 4: Search TypesΒΆ
Keyword Search (BM25)ΒΆ
Traditional search β matches exact words. Fast, interpretable, but misses synonyms and intent.
Query: "waterproof jacket"
Finds: documents containing exactly "waterproof" and "jacket"
Misses: "rain-resistant coat", "weatherproof outerwear"
Semantic Search (Vector)ΒΆ
Compares meaning, not words. Finds conceptually similar content.
Query: "waterproof jacket"
Finds: "rain-resistant coat", "all-weather outerwear", "waterproof jacket"
Based on: vector similarity in embedding space
Hybrid Search (BM25 + Vector)ΒΆ
Best of both worlds β combines keyword and semantic scores.
Most production RAG systems use hybrid search because it handles both exact lookups ("SKU-12345") and semantic queries ("something for camping in the rain").
π€ Check Your Understanding
A user searches for "rain-resistant coat" but the document only contains the phrase "waterproof jacket." Will keyword search find it? Will semantic search? Why?
Answer
Keyword search will miss it β there are no matching words between "rain-resistant coat" and "waterproof jacket." Semantic search will find it because it compares meaning via vector similarity, not exact words. Both phrases have very similar meanings and will have similar vector representations. This is why hybrid search (combining both) is preferred in production.
Part 5: RAG vs. Fine-tuning vs. Big ContextΒΆ
A common question: "Why not just fine-tune the model on my data?"
| Approach | Cost | Freshness | Best for |
|---|---|---|---|
| RAG | Low | β Real-time | Dynamic data, documents, Q&A |
| Fine-tuning | High | β Static | Tone, style, domain vocabulary |
| Big context | Medium | β Per-request | Small datasets that fit in context |
| RAG + Fine-tuning | High | β Real-time | Production systems needing both |
Rule of thumb: Use RAG for knowledge (facts, documents). Use fine-tuning for behavior (tone, style, format). They're complementary, not competing.
Part 6: RAG Quality MetricsΒΆ
A RAG system can fail in two places:
| Failure | Symptom | Fix |
|---|---|---|
| Bad retrieval | Retrieved chunks aren't relevant | Better chunking, hybrid search, re-ranking |
| Bad generation | LLM ignores or misuses retrieved content | Stronger system prompt, citation enforcement |
Key metrics used in Lab 035 and Lab 042:
- Groundedness: Is the answer supported by retrieved documents?
- Relevance: Are the retrieved chunks actually relevant to the question?
- Coherence: Is the answer well-structured and readable?
- Faithfulness: Does the answer stay true to the source material?
Real-World RAG ArchitecturesΒΆ
Basic RAGΒΆ
Agentic RAG (covered in Lab 026)ΒΆ
User β Agent decides: search? what query? how many chunks?
β Multiple targeted searches
β Agent synthesizes results
β Answer with citations
Corrective RAGΒΆ
User β Retrieve β Grade relevance β If poor: web search fallback
β Augment β Generate β Self-check β Answer
π§ Knowledge CheckΒΆ
Q1 (Multiple Choice): What does RAG stand for, and what problem does it primarily solve?
- A) Recursive Augmented Graph β it solves multi-step reasoning problems
- B) Retrieval-Augmented Generation β it grounds LLM answers in private or up-to-date data without retraining
- C) Randomized Agent Generation β it makes agent responses less deterministic
- D) Ranked Answer Generation β it improves ranking of search results
β Reveal Answer
Correct: B β Retrieval-Augmented Generation
RAG connects the LLM to your own knowledge at query time. Instead of the model relying on training data (which has a cutoff date and doesn't include your private documents), RAG retrieves the most relevant chunks from your data store and includes them in the prompt. No retraining needed.
Q2 (Multiple Choice): In the RAG ingestion pipeline, which step comes immediately BEFORE storing vectors in the database?
- A) Chunking
- B) Loading documents
- C) Generating embeddings
- D) Semantic reranking
β Reveal Answer
Correct: C β Generating embeddings
The ingestion order is: Load β Chunk β Embed β Store. You first load raw documents, split them into smaller chunks (~512 tokens with overlap), then convert each chunk to a vector embedding using the embedding model, and finally store those vectors in the vector database. Semantic reranking is a retrieval step, not an ingestion step.
Q3 (Run the Lab): Open the file lab-006/faq_tents.txt. How many Q&A pairs does it contain, and what is the topic of the LAST question?
Open lab-006/faq_tents.txt and count the Q&A pairs. The last question starts with "Q:".
β Reveal Answer
5 Q&A pairs. The last question is: "Can I use a 2-person tent solo?"
π₯ faq_tents.txt contains exactly 5 Q&A pairs covering: solo backpacking tent choice, 3-season vs 4-season, waterproofing, pole materials, and using a 2P tent solo. This is the kind of knowledge base a RAG system would ingest β each Q&A pair is a natural chunk for embedding.
SummaryΒΆ
| Concept | Key takeaway |
|---|---|
| Why RAG | LLMs don't know your data or recent events β RAG fixes this |
| Ingestion | Load β Chunk β Embed β Store (runs once) |
| Retrieval | Embed query β Vector search β Top-k chunks |
| Chunking | Size + overlap matter; ~512 tokens with 50-token overlap |
| Search | Hybrid (keyword + semantic) beats either alone |
| Evaluation | Measure groundedness and relevance β both retrieval and generation |
Next StepsΒΆ
- Understand the embedding vectors: β Lab 007 β What are Embeddings?
- Build a RAG app for free: β Lab 022 β RAG with GitHub Models + pgvector
- Production RAG on Azure: β Lab 031 β pgvector Semantic Search on Azure