Lab 068: Hybrid Search β Vector + BM25 + Semantic RankerΒΆ
What You'll LearnΒΆ
- The differences between BM25 (keyword), vector (semantic), and hybrid search
- How Reciprocal Rank Fusion (RRF) combines BM25 and vector scores
- How a semantic ranker (cross-encoder reranker) improves precision
- Measure retrieval quality with recall and precision metrics
- Compare search strategies using a benchmark dataset
- Identify which query types benefit most from hybrid + reranking
Prerequisite
Complete Lab 009: Retrieval-Augmented Generation first. This lab assumes familiarity with embedding-based retrieval and basic search concepts.
IntroductionΒΆ
RAG pipelines depend on retrieval quality β if you retrieve the wrong chunks, even the best LLM will produce wrong answers. Modern search combines multiple strategies to maximize recall and precision:
| Strategy | How It Works | Strengths | Weaknesses |
|---|---|---|---|
| BM25 | TF-IDF keyword matching | Exact matches, rare terms | No semantic understanding |
| Vector | Cosine similarity on embeddings | Semantic similarity, synonyms | Misses exact keywords |
| Hybrid (RRF) | Combines BM25 + vector via rank fusion | Best of both worlds | Higher latency |
| Hybrid + Rerank | Hybrid + cross-encoder reranking | Highest quality results | Highest latency and cost |
The ScenarioΒΆ
You have 20 search queries with known relevant documents (ground truth). Each query has been executed against all four strategies, with recall and precision recorded. Your job: analyze which strategy delivers the best retrieval quality and understand when each approach shines.
PrerequisitesΒΆ
| Requirement | Why |
|---|---|
| Python 3.10+ | Run analysis scripts |
pandas |
Analyze search comparison data |
π¦ Supporting FilesΒΆ
Download these files before starting the lab
Save all files to a lab-068/ folder in your working directory.
| File | Description | Download |
|---|---|---|
broken_search.py |
Bug-fix exercise (3 bugs + self-tests) | π₯ Download |
search_comparison.csv |
Dataset | π₯ Download |
Step 1: Understanding Search StrategiesΒΆ
Each search strategy processes queries differently:
Query β β¬β [BM25 Index] β keyword matches ββ
β ββ [RRF Fusion] β Hybrid Results
ββ [Vector Index] β semantic matches ββ β
[Semantic Ranker]
β
Hybrid + Rerank Results
Key metrics:
- Recall β What fraction of relevant documents were retrieved? (higher = fewer misses)
- Precision β What fraction of retrieved documents are relevant? (higher = less noise)
- RRF Score β Reciprocal Rank Fusion combines rankings:
1/(k + rank)summed across strategies - Rerank Score β Cross-encoder relevance score applied to hybrid results
Why Hybrid Beats Both
BM25 excels at exact keyword matches ("error code 0x8004") while vector search excels at semantic meaning ("application crashes on startup"). Hybrid search fuses both β capturing exact matches that vector search misses AND semantic matches that BM25 misses. The semantic ranker then reorders results using a more expensive but more accurate cross-encoder model.
Step 2: Load and Explore Search ResultsΒΆ
The dataset contains 20 queries with results from all four strategies:
import pandas as pd
results = pd.read_csv("lab-068/search_comparison.csv")
print(f"Total queries: {len(results)}")
print(f"Search strategies: {sorted(results.columns)}")
print(f"\nFirst 5 queries:")
print(results[["query_id", "query_text", "bm25_recall", "vector_recall",
"hybrid_recall", "hybrid_rerank_recall"]].head().to_string(index=False))
Expected:
Step 3: Recall ComparisonΒΆ
Compare recall across all four strategies:
print("Average Recall by Strategy:")
print(f" BM25: {results['bm25_recall'].mean():.2f}")
print(f" Vector: {results['vector_recall'].mean():.2f}")
print(f" Hybrid: {results['hybrid_recall'].mean():.2f}")
print(f" Hybrid + Rerank: {results['hybrid_rerank_recall'].mean():.2f}")
perfect_recall = results[results["hybrid_rerank_recall"] == 1.0]
print(f"\nQueries with perfect hybrid+rerank recall: {len(perfect_recall)} / {len(results)}")
Expected:
Key Insight
Hybrid + Rerank achieves perfect recall (1.00) β every relevant document is retrieved for every query. BM25 alone retrieves less than half the relevant documents on average. This demonstrates why modern RAG pipelines should use hybrid search with reranking whenever possible.
Step 4: Precision AnalysisΒΆ
Recall without precision means retrieving too much noise. Analyze precision:
print("Average Precision by Strategy:")
print(f" BM25: {results['bm25_precision'].mean():.2f}")
print(f" Vector: {results['vector_precision'].mean():.2f}")
print(f" Hybrid: {results['hybrid_precision'].mean():.2f}")
print(f" Hybrid + Rerank: {results['hybrid_rerank_precision'].mean():.2f}")
Expected:
Precision vs Recall Trade-off
Even hybrid + rerank achieves only 0.57 average precision β meaning 43% of retrieved documents are not relevant. High recall ensures no relevant documents are missed, but the LLM must filter noise from the context. Consider using a stricter rerank threshold to improve precision at the cost of some recall.
Step 5: Query-Level AnalysisΒΆ
Identify which queries benefit most from hybrid search:
results["hybrid_lift"] = results["hybrid_rerank_recall"] - results["bm25_recall"]
biggest_lift = results.sort_values("hybrid_lift", ascending=False).head(5)
print("Queries with biggest recall lift (hybrid+rerank vs BM25):")
print(biggest_lift[["query_id", "query_text", "bm25_recall", "hybrid_rerank_recall", "hybrid_lift"]]
.to_string(index=False))
Queries with the biggest lift are typically semantic in nature β paraphrases, synonyms, or conceptual queries where BM25's keyword matching fails but vector similarity succeeds.
Step 6: Search Strategy Recommendation EngineΒΆ
Build a recommendation based on the analysis:
summary = f"""
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Hybrid Search β Strategy Comparison Report β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β Queries Evaluated: {len(results):>5} β
β BM25 Avg Recall: {results['bm25_recall'].mean():>5.2f} β
β Vector Avg Recall: {results['vector_recall'].mean():>5.2f} β
β Hybrid Avg Recall: {results['hybrid_recall'].mean():>5.2f} β
β Hybrid+Rerank Avg Recall: {results['hybrid_rerank_recall'].mean():>5.2f} β
β Hybrid+Rerank Avg Precision: {results['hybrid_rerank_precision'].mean():>5.2f} β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
"""
print(summary)
π Bug-Fix ExerciseΒΆ
The file lab-068/broken_search.py has 3 bugs in how it calculates search metrics:
| Test | What it checks | Hint |
|---|---|---|
| Test 1 | Average recall calculation | Should use mean(), not sum() |
| Test 2 | Precision column name | Should use hybrid_rerank_precision, not hybrid_precision |
| Test 3 | Recall comparison | Should compare hybrid_rerank_recall >= bm25_recall, not <= |
π§ Knowledge CheckΒΆ
Q1 (Multiple Choice): What does Reciprocal Rank Fusion (RRF) do in hybrid search?
- A) Replaces the vector index with a keyword index
- B) Combines rankings from multiple search strategies into a single unified ranking
- C) Trains a new embedding model on the query
- D) Reduces the number of documents in the index
β Reveal Answer
Correct: B) Combines rankings from multiple search strategies into a single unified ranking
RRF merges result rankings from BM25 and vector search using the formula 1/(k + rank) summed across strategies. Documents ranked highly by both strategies get boosted, while documents ranked highly by only one strategy still appear. This produces a unified ranking that captures both keyword and semantic relevance.
Q2 (Multiple Choice): Why does a semantic ranker (cross-encoder) improve results over hybrid search alone?
- A) It is faster than BM25
- B) It re-scores each candidate by jointly encoding the query and document together, capturing deeper relevance signals
- C) It removes all irrelevant documents perfectly
- D) It generates new documents to fill gaps
β Reveal Answer
Correct: B) It re-scores each candidate by jointly encoding the query and document together, capturing deeper relevance signals
A cross-encoder takes both the query and a candidate document as input and produces a relevance score. Unlike bi-encoders (used for vector search), cross-encoders capture fine-grained interactions between query and document tokens. This is more accurate but too expensive to apply to the entire index β so it is used as a reranker on the top-N hybrid results.
Q3 (Run the Lab): What is the average recall of the hybrid + rerank strategy?
Compute results['hybrid_rerank_recall'].mean().
β Reveal Answer
1.00 (perfect recall)
Hybrid + rerank achieves perfect recall across all 20 queries, meaning every relevant document is retrieved for every query. This is a significant improvement over BM25 alone (0.47) and demonstrates the value of combining keyword and semantic search with cross-encoder reranking.
Q4 (Run the Lab): What is the average recall of BM25 search alone?
Compute results['bm25_recall'].mean().
β Reveal Answer
0.47 average recall
BM25 retrieves less than half of the relevant documents on average. This is because BM25 relies on keyword matching and cannot handle synonyms, paraphrases, or conceptual queries. For example, a query about "application crashes" would miss documents that discuss "software failures" or "system instability."
Q5 (Run the Lab): What is the average precision of the hybrid + rerank strategy?
Compute results['hybrid_rerank_precision'].mean().
β Reveal Answer
0.57 average precision
While hybrid + rerank achieves perfect recall, its precision is 0.57 β meaning 43% of retrieved documents are not relevant. This is the recall-precision trade-off: maximizing recall ensures no relevant documents are missed, but includes some noise. The LLM must be robust enough to ignore irrelevant context when generating answers.
SummaryΒΆ
| Topic | What You Learned |
|---|---|
| BM25 Search | Keyword-based retrieval using TF-IDF scoring |
| Vector Search | Semantic retrieval using embedding cosine similarity |
| Hybrid Search | Combining BM25 + vector via Reciprocal Rank Fusion |
| Semantic Ranker | Cross-encoder reranking for higher-quality result ordering |
| Recall & Precision | Measuring retrieval quality with complementary metrics |
| Strategy Selection | Choosing the right search strategy based on query characteristics |