Lab 026: Agentic RAG Pattern¶
O que Você Vai Aprender¶
- Por que o RAG ingênuo falha em perguntas complexas
- Reescrita de consulta — melhorando a recuperação antes da busca
- Hypothetical Document Embeddings (HyDE) — gerar para recuperar
- RAG multi-hop — recuperação iterativa para perguntas de múltiplas partes
- Autorreflexão — o agente avalia a qualidade de sua própria resposta
Introdução¶
RAG ingênuo: incorporar consulta → buscar → gerar. Funciona bem para perguntas simples. Falha para:
- Perguntas vagas: "Me fale sobre isso" (o que é "isso"?)
- Multi-hop: "O que é mais barato — equipamento de camping ou de escalada? Quanto eu economizaria?"
- Lacunas de conhecimento: "Qual é o produto mais novo?" (pode precisar saber a data atual)
- Alucinação: o modelo inventa fatos que não estão no contexto
O RAG agêntico adiciona ciclos de raciocínio em torno da recuperação. O agente decide como buscar, avalia os resultados e tenta novamente se necessário.
Pré-requisitos¶
- Ter completado o Lab 022 (pgvector rodando + documentos ingeridos)
GITHUB_TOKENset- pgvector container running:
docker start pgvector-rag
Dados de exemplo já carregados?
Se você executou a etapa de ingestão do Lab 022 com o dataset de exemplo, você já tem 42 documentos no pgvector prontos para este lab. Caso contrário, execute o Passo 3 do Lab 022 primeiro.
Exercício do Lab¶
Passo 1: Reescrita de consulta¶
Antes de buscar, peça ao LLM para reescrever a pergunta do usuário em melhores consultas de busca.
import os
from openai import OpenAI
from search import search # from Lab 022
client = OpenAI(
base_url="https://models.inference.ai.azure.com",
api_key=os.environ["GITHUB_TOKEN"],
)
def rewrite_query(original: str) -> list[str]:
"""Generate 3 search query variations for better recall."""
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[
{
"role": "system",
"content": (
"You generate search queries. Given a user question, produce 3 different "
"search queries that would find relevant information. "
"Return one query per line, no numbering or bullets."
)
},
{"role": "user", "content": f"Question: {original}"}
]
)
queries = response.choices[0].message.content.strip().split("\n")
return [q.strip() for q in queries if q.strip()]
# Test
question = "Is this tent good for rain?"
queries = rewrite_query(question)
print("Rewritten queries:")
for q in queries:
print(f" • {q}")
# Example output:
# • Summit Pro Tent waterproof rating
# • tent performance in wet weather conditions
# • camping shelter rain protection features
Passo 2: Recuperar com múltiplas consultas e deduplicar¶
def retrieve_with_rewriting(question: str, top_k: int = 3) -> list[dict]:
queries = [question] + rewrite_query(question)
seen_titles = set()
all_docs = []
for q in queries:
results = search(q, top_k=top_k)
for doc in results:
if doc["title"] not in seen_titles and doc["similarity"] > 0.70:
seen_titles.add(doc["title"])
all_docs.append(doc)
# Sort by best similarity across queries
all_docs.sort(key=lambda d: d["similarity"], reverse=True)
return all_docs[:top_k]
Passo 3: HyDE — Hypothetical Document Embeddings¶
Em vez de incorporar a pergunta, gere uma resposta hipotética e incorpore-a. Isso frequentemente corresponde melhor ao documento real.
def hyde_search(question: str, top_k: int = 3) -> list[dict]:
# Generate a hypothetical answer
hyp_response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0.3,
messages=[
{"role": "system", "content": "Write a short, factual answer to this question as if you knew the answer. Be specific."},
{"role": "user", "content": question}
]
)
hypothetical_answer = hyp_response.choices[0].message.content
print(f" [HyDE] Generated hypothesis: {hypothetical_answer[:80]}...")
# Search using the hypothetical answer (more content = better embedding match)
return search(hypothetical_answer, top_k=top_k)
Passo 4: RAG Multi-hop¶
Para perguntas complexas, recupere → gere resposta parcial → recupere novamente.
def multi_hop_answer(question: str) -> str:
max_hops = 3
context_docs = []
current_question = question
for hop in range(max_hops):
print(f"\n [Hop {hop+1}] Searching: {current_question}")
new_docs = retrieve_with_rewriting(current_question)
context_docs.extend(new_docs)
context = "\n\n".join([
f"**{d['title']}**\n{d['content']}" for d in context_docs
])
# Ask the model: can I answer, or do I need more info?
check_response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[
{
"role": "system",
"content": (
"Given context and a question, respond with JSON: "
'{"can_answer": true/false, "answer": "...", "follow_up_query": "..."}\n'
"Set can_answer=true if context fully answers the question. "
"Otherwise, set follow_up_query to what you still need."
)
},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
import json
result = json.loads(check_response.choices[0].message.content)
if result["can_answer"]:
print(f" ✅ Answered after {hop+1} hop(s)")
return result["answer"]
current_question = result.get("follow_up_query", question)
print(f" 🔄 Need more info: {current_question}")
# Final attempt with all gathered context
return answer_with_context(question, context_docs)
def answer_with_context(question: str, docs: list[dict]) -> str:
context = "\n\n".join([f"**{d['title']}**\n{d['content']}" for d in docs])
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
messages=[
{"role": "system", "content": "Answer based only on the provided context. Be honest if information is incomplete."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
]
)
return response.choices[0].message.content
# Test multi-hop
print(multi_hop_answer("What's the cheapest product suitable for a Rainier climb?"))
Passo 5: Autorreflexão (verificação de qualidade da resposta)¶
from pydantic import BaseModel
class AnswerQuality(BaseModel):
is_grounded: bool # Is the answer supported by the context?
has_hallucination: bool # Did the model add facts not in context?
confidence: float # 0.0 to 1.0
issues: list[str] # List of specific problems, if any
def check_answer_quality(question: str, context: str, answer: str) -> AnswerQuality:
response = client.beta.chat.completions.parse(
model="gpt-4o-mini",
temperature=0,
messages=[
{
"role": "system",
"content": "You are a RAG quality evaluator. Check if the answer is grounded in the context."
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}\n\nAnswer: {answer}"
}
],
response_format=AnswerQuality,
)
return response.choices[0].message.parsed
Resumo da Arquitetura de RAG Agêntico¶
User Question
│
▼
Query Rewriting ──► 3 query variants
│
▼
Parallel Search ──► deduplicated docs
│
▼
Can I answer? ──No──► Follow-up search (multi-hop)
│Yes
▼
Generate Answer
│
▼
Self-Reflection ──► Is it grounded?
│
▼
Return Answer (or retry)
Próximos Passos¶
- Memória do agente entre sessões: → Lab 027 — Agent Memory Patterns
- Avaliar qualidade do RAG em escala: → Lab 035 — Agent Evaluation