Skip to content

Lab 068: Hybrid Search β€” Vector + BM25 + Semantic RankerΒΆ

Level: L200 Path: All paths Time: ~60 min πŸ’° Cost: Free β€” Pre-computed search results (no Azure AI Search required)

What You'll LearnΒΆ

  • The differences between BM25 (keyword), vector (semantic), and hybrid search
  • How Reciprocal Rank Fusion (RRF) combines BM25 and vector scores
  • How a semantic ranker (cross-encoder reranker) improves precision
  • Measure retrieval quality with recall and precision metrics
  • Compare search strategies using a benchmark dataset
  • Identify which query types benefit most from hybrid + reranking

Prerequisite

Complete Lab 009: Retrieval-Augmented Generation first. This lab assumes familiarity with embedding-based retrieval and basic search concepts.

IntroductionΒΆ

RAG pipelines depend on retrieval quality β€” if you retrieve the wrong chunks, even the best LLM will produce wrong answers. Modern search combines multiple strategies to maximize recall and precision:

Strategy How It Works Strengths Weaknesses
BM25 TF-IDF keyword matching Exact matches, rare terms No semantic understanding
Vector Cosine similarity on embeddings Semantic similarity, synonyms Misses exact keywords
Hybrid (RRF) Combines BM25 + vector via rank fusion Best of both worlds Higher latency
Hybrid + Rerank Hybrid + cross-encoder reranking Highest quality results Highest latency and cost

The ScenarioΒΆ

You have 20 search queries with known relevant documents (ground truth). Each query has been executed against all four strategies, with recall and precision recorded. Your job: analyze which strategy delivers the best retrieval quality and understand when each approach shines.


PrerequisitesΒΆ

Requirement Why
Python 3.10+ Run analysis scripts
pandas Analyze search comparison data
pip install pandas

Quick Start with GitHub Codespaces

Open in GitHub Codespaces

All dependencies are pre-installed in the devcontainer.

πŸ“¦ Supporting FilesΒΆ

Download these files before starting the lab

Save all files to a lab-068/ folder in your working directory.

File Description Download
broken_search.py Bug-fix exercise (3 bugs + self-tests) πŸ“₯ Download
search_comparison.csv Dataset πŸ“₯ Download

Step 1: Understanding Search StrategiesΒΆ

Each search strategy processes queries differently:

Query β†’ ┬─ [BM25 Index]      β†’ keyword matches   ─┐
        β”‚                                          β”œβ”€ [RRF Fusion] β†’ Hybrid Results
        └─ [Vector Index]     β†’ semantic matches   β”€β”˜                      ↓
                                                                  [Semantic Ranker]
                                                                         ↓
                                                               Hybrid + Rerank Results

Key metrics:

  1. Recall β€” What fraction of relevant documents were retrieved? (higher = fewer misses)
  2. Precision β€” What fraction of retrieved documents are relevant? (higher = less noise)
  3. RRF Score β€” Reciprocal Rank Fusion combines rankings: 1/(k + rank) summed across strategies
  4. Rerank Score β€” Cross-encoder relevance score applied to hybrid results

Why Hybrid Beats Both

BM25 excels at exact keyword matches ("error code 0x8004") while vector search excels at semantic meaning ("application crashes on startup"). Hybrid search fuses both β€” capturing exact matches that vector search misses AND semantic matches that BM25 misses. The semantic ranker then reorders results using a more expensive but more accurate cross-encoder model.


Step 2: Load and Explore Search ResultsΒΆ

The dataset contains 20 queries with results from all four strategies:

import pandas as pd

results = pd.read_csv("lab-068/search_comparison.csv")
print(f"Total queries: {len(results)}")
print(f"Search strategies: {sorted(results.columns)}")
print(f"\nFirst 5 queries:")
print(results[["query_id", "query_text", "bm25_recall", "vector_recall",
               "hybrid_recall", "hybrid_rerank_recall"]].head().to_string(index=False))

Expected:

Total queries: 20

Step 3: Recall ComparisonΒΆ

Compare recall across all four strategies:

print("Average Recall by Strategy:")
print(f"  BM25:            {results['bm25_recall'].mean():.2f}")
print(f"  Vector:          {results['vector_recall'].mean():.2f}")
print(f"  Hybrid:          {results['hybrid_recall'].mean():.2f}")
print(f"  Hybrid + Rerank: {results['hybrid_rerank_recall'].mean():.2f}")

perfect_recall = results[results["hybrid_rerank_recall"] == 1.0]
print(f"\nQueries with perfect hybrid+rerank recall: {len(perfect_recall)} / {len(results)}")

Expected:

Average Recall by Strategy:
  BM25:            0.47
  Vector:          0.62
  Hybrid:          0.85
  Hybrid + Rerank: 1.00

Key Insight

Hybrid + Rerank achieves perfect recall (1.00) β€” every relevant document is retrieved for every query. BM25 alone retrieves less than half the relevant documents on average. This demonstrates why modern RAG pipelines should use hybrid search with reranking whenever possible.


Step 4: Precision AnalysisΒΆ

Recall without precision means retrieving too much noise. Analyze precision:

print("Average Precision by Strategy:")
print(f"  BM25:            {results['bm25_precision'].mean():.2f}")
print(f"  Vector:          {results['vector_precision'].mean():.2f}")
print(f"  Hybrid:          {results['hybrid_precision'].mean():.2f}")
print(f"  Hybrid + Rerank: {results['hybrid_rerank_precision'].mean():.2f}")

Expected:

Average Precision by Strategy:
  BM25:            0.40
  Vector:          0.48
  Hybrid:          0.52
  Hybrid + Rerank: 0.57

Precision vs Recall Trade-off

Even hybrid + rerank achieves only 0.57 average precision β€” meaning 43% of retrieved documents are not relevant. High recall ensures no relevant documents are missed, but the LLM must filter noise from the context. Consider using a stricter rerank threshold to improve precision at the cost of some recall.


Step 5: Query-Level AnalysisΒΆ

Identify which queries benefit most from hybrid search:

results["hybrid_lift"] = results["hybrid_rerank_recall"] - results["bm25_recall"]
biggest_lift = results.sort_values("hybrid_lift", ascending=False).head(5)
print("Queries with biggest recall lift (hybrid+rerank vs BM25):")
print(biggest_lift[["query_id", "query_text", "bm25_recall", "hybrid_rerank_recall", "hybrid_lift"]]
      .to_string(index=False))

Queries with the biggest lift are typically semantic in nature β€” paraphrases, synonyms, or conceptual queries where BM25's keyword matching fails but vector similarity succeeds.


Step 6: Search Strategy Recommendation EngineΒΆ

Build a recommendation based on the analysis:

summary = f"""
╔════════════════════════════════════════════════════════╗
β•‘     Hybrid Search β€” Strategy Comparison Report         β•‘
╠════════════════════════════════════════════════════════╣
β•‘ Queries Evaluated:           {len(results):>5}                     β•‘
β•‘ BM25 Avg Recall:             {results['bm25_recall'].mean():>5.2f}                     β•‘
β•‘ Vector Avg Recall:           {results['vector_recall'].mean():>5.2f}                     β•‘
β•‘ Hybrid Avg Recall:           {results['hybrid_recall'].mean():>5.2f}                     β•‘
β•‘ Hybrid+Rerank Avg Recall:    {results['hybrid_rerank_recall'].mean():>5.2f}                     β•‘
β•‘ Hybrid+Rerank Avg Precision: {results['hybrid_rerank_precision'].mean():>5.2f}                     β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
"""
print(summary)

πŸ› Bug-Fix ExerciseΒΆ

The file lab-068/broken_search.py has 3 bugs in how it calculates search metrics:

python lab-068/broken_search.py
Test What it checks Hint
Test 1 Average recall calculation Should use mean(), not sum()
Test 2 Precision column name Should use hybrid_rerank_precision, not hybrid_precision
Test 3 Recall comparison Should compare hybrid_rerank_recall >= bm25_recall, not <=

🧠 Knowledge Check¢

Q1 (Multiple Choice): What does Reciprocal Rank Fusion (RRF) do in hybrid search?
  • A) Replaces the vector index with a keyword index
  • B) Combines rankings from multiple search strategies into a single unified ranking
  • C) Trains a new embedding model on the query
  • D) Reduces the number of documents in the index
βœ… Reveal Answer

Correct: B) Combines rankings from multiple search strategies into a single unified ranking

RRF merges result rankings from BM25 and vector search using the formula 1/(k + rank) summed across strategies. Documents ranked highly by both strategies get boosted, while documents ranked highly by only one strategy still appear. This produces a unified ranking that captures both keyword and semantic relevance.

Q2 (Multiple Choice): Why does a semantic ranker (cross-encoder) improve results over hybrid search alone?
  • A) It is faster than BM25
  • B) It re-scores each candidate by jointly encoding the query and document together, capturing deeper relevance signals
  • C) It removes all irrelevant documents perfectly
  • D) It generates new documents to fill gaps
βœ… Reveal Answer

Correct: B) It re-scores each candidate by jointly encoding the query and document together, capturing deeper relevance signals

A cross-encoder takes both the query and a candidate document as input and produces a relevance score. Unlike bi-encoders (used for vector search), cross-encoders capture fine-grained interactions between query and document tokens. This is more accurate but too expensive to apply to the entire index β€” so it is used as a reranker on the top-N hybrid results.

Q3 (Run the Lab): What is the average recall of the hybrid + rerank strategy?

Compute results['hybrid_rerank_recall'].mean().

βœ… Reveal Answer

1.00 (perfect recall)

Hybrid + rerank achieves perfect recall across all 20 queries, meaning every relevant document is retrieved for every query. This is a significant improvement over BM25 alone (0.47) and demonstrates the value of combining keyword and semantic search with cross-encoder reranking.

Q4 (Run the Lab): What is the average recall of BM25 search alone?

Compute results['bm25_recall'].mean().

βœ… Reveal Answer

0.47 average recall

BM25 retrieves less than half of the relevant documents on average. This is because BM25 relies on keyword matching and cannot handle synonyms, paraphrases, or conceptual queries. For example, a query about "application crashes" would miss documents that discuss "software failures" or "system instability."

Q5 (Run the Lab): What is the average precision of the hybrid + rerank strategy?

Compute results['hybrid_rerank_precision'].mean().

βœ… Reveal Answer

0.57 average precision

While hybrid + rerank achieves perfect recall, its precision is 0.57 β€” meaning 43% of retrieved documents are not relevant. This is the recall-precision trade-off: maximizing recall ensures no relevant documents are missed, but includes some noise. The LLM must be robust enough to ignore irrelevant context when generating answers.


SummaryΒΆ

Topic What You Learned
BM25 Search Keyword-based retrieval using TF-IDF scoring
Vector Search Semantic retrieval using embedding cosine similarity
Hybrid Search Combining BM25 + vector via Reciprocal Rank Fusion
Semantic Ranker Cross-encoder reranking for higher-quality result ordering
Recall & Precision Measuring retrieval quality with complementary metrics
Strategy Selection Choosing the right search strategy based on query characteristics

Next StepsΒΆ

  • Lab 009 β€” RAG Basics (foundational retrieval patterns)
  • Lab 067 β€” GraphRAG (cross-document synthesis with knowledge graphs)
  • Lab 065 β€” Purview DSPM for AI (governance for search pipelines)