Lab 026: Agentic RAG PatternΒΆ
What You'll LearnΒΆ
- Why naive RAG fails on complex questions
- Query rewriting β improving retrieval before searching
- Hypothetical Document Embeddings (HyDE) β generate to retrieve
- Multi-hop RAG β iterative retrieval for multi-part questions
- Self-reflection β agent evaluates its own answer quality
IntroductionΒΆ
Naive RAG: embed query β search β generate. Works fine for simple questions. Fails for:
- Vague questions: "Tell me about it" (what is "it"?)
- Multi-hop: "What's cheaper β camping or climbing gear? What would I save?"
- Knowledge gaps: "What's the newest product?" (may need to know current date)
- Hallucination: model invents facts not in context
Agentic RAG adds reasoning loops around retrieval. The agent decides how to search, evaluates results, and retries if needed.
PrerequisitesΒΆ
- Completed Lab 022 (pgvector running + documents ingested)
GITHUB_TOKENset- pgvector container running:
docker start pgvector-rag
Sample data already loaded?
If you ran Lab 022's ingest step with the sample dataset, you already have 42 documents in pgvector ready for this lab. If not, run Lab 022's Step 3 first.
Lab ExerciseΒΆ
Step 1: Query rewritingΒΆ
Before searching, ask the LLM to rewrite the user's question into better search queries.
import os
from openai import OpenAI
from search import search # from Lab 022
client = OpenAI(
base_url="https://models.inference.ai.azure.com",
api_key=os.environ["GITHUB_TOKEN"],
)
def rewrite_query(original: str) -> list[str]:
"""Generate 3 search query variations for better recall."""
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[
{
"role": "system",
"content": (
"You generate search queries. Given a user question, produce 3 different "
"search queries that would find relevant information. "
"Return one query per line, no numbering or bullets."
)
},
{"role": "user", "content": f"Question: {original}"}
]
)
queries = response.choices[0].message.content.strip().split("\n")
return [q.strip() for q in queries if q.strip()]
# Test
question = "Is this tent good for rain?"
queries = rewrite_query(question)
print("Rewritten queries:")
for q in queries:
print(f" β’ {q}")
# Example output:
# β’ Summit Pro Tent waterproof rating
# β’ tent performance in wet weather conditions
# β’ camping shelter rain protection features
Step 2: Retrieve with multiple queries and deduplicateΒΆ
def retrieve_with_rewriting(question: str, top_k: int = 3) -> list[dict]:
queries = [question] + rewrite_query(question)
seen_titles = set()
all_docs = []
for q in queries:
results = search(q, top_k=top_k)
for doc in results:
if doc["title"] not in seen_titles and doc["similarity"] > 0.70:
seen_titles.add(doc["title"])
all_docs.append(doc)
# Sort by best similarity across queries
all_docs.sort(key=lambda d: d["similarity"], reverse=True)
return all_docs[:top_k]
Step 3: HyDE β Hypothetical Document EmbeddingsΒΆ
Instead of embedding the question, generate a hypothetical answer and embed that. This often matches the real document better.
def hyde_search(question: str, top_k: int = 3) -> list[dict]:
# Generate a hypothetical answer
hyp_response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0.3,
messages=[
{"role": "system", "content": "Write a short, factual answer to this question as if you knew the answer. Be specific."},
{"role": "user", "content": question}
]
)
hypothetical_answer = hyp_response.choices[0].message.content
print(f" [HyDE] Generated hypothesis: {hypothetical_answer[:80]}...")
# Search using the hypothetical answer (more content = better embedding match)
return search(hypothetical_answer, top_k=top_k)
Step 4: Multi-hop RAGΒΆ
For complex questions, retrieve β generate partial answer β retrieve again.
def multi_hop_answer(question: str) -> str:
max_hops = 3
context_docs = []
current_question = question
for hop in range(max_hops):
print(f"\n [Hop {hop+1}] Searching: {current_question}")
new_docs = retrieve_with_rewriting(current_question)
context_docs.extend(new_docs)
context = "\n\n".join([
f"**{d['title']}**\n{d['content']}" for d in context_docs
])
# Ask the model: can I answer, or do I need more info?
check_response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[
{
"role": "system",
"content": (
"Given context and a question, respond with JSON: "
'{"can_answer": true/false, "answer": "...", "follow_up_query": "..."}\n'
"Set can_answer=true if context fully answers the question. "
"Otherwise, set follow_up_query to what you still need."
)
},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
import json
result = json.loads(check_response.choices[0].message.content)
if result["can_answer"]:
print(f" β
Answered after {hop+1} hop(s)")
return result["answer"]
current_question = result.get("follow_up_query", question)
print(f" π Need more info: {current_question}")
# Final attempt with all gathered context
return answer_with_context(question, context_docs)
def answer_with_context(question: str, docs: list[dict]) -> str:
context = "\n\n".join([f"**{d['title']}**\n{d['content']}" for d in docs])
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[
{"role": "system", "content": "Answer based only on the provided context. Be honest if information is incomplete."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
return response.choices[0].message.content
# Test multi-hop
print(multi_hop_answer("What's the cheapest product suitable for a Rainier climb?"))
Step 5: Self-reflection (answer quality check)ΒΆ
from pydantic import BaseModel
class AnswerQuality(BaseModel):
is_grounded: bool # Is the answer supported by the context?
has_hallucination: bool # Did the model add facts not in context?
confidence: float # 0.0 to 1.0
issues: list[str] # List of specific problems, if any
def check_answer_quality(question: str, context: str, answer: str) -> AnswerQuality:
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
temperature=0,
messages=[
{
"role": "system",
"content": "You are a RAG quality evaluator. Check if the answer is grounded in the context."
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer: {answer}"
}
],
response_format=AnswerQuality,
)
return response.choices[0].message.parsed
Agentic RAG Architecture SummaryΒΆ
User Question
β
βΌ
Query Rewriting βββΊ 3 query variants
β
βΌ
Parallel Search βββΊ deduplicated docs
β
βΌ
Can I answer? ββNoβββΊ Follow-up search (multi-hop)
βYes
βΌ
Generate Answer
β
βΌ
Self-Reflection βββΊ Is it grounded?
β
βΌ
Return Answer (or retry)
Next StepsΒΆ
- Agent memory across sessions: β Lab 027 β Agent Memory Patterns
- Evaluate RAG quality at scale: β Lab 035 β Agent Evaluation