Lab 067: GraphRAG β Knowledge Graphs for Cross-Document RetrievalΒΆ
What You'll LearnΒΆ
- What GraphRAG is and how it differs from traditional vector-only RAG
- Build a knowledge graph from entity and relationship extraction
- Detect communities using graph clustering algorithms
- Execute global queries that synthesize across all documents
- Execute local queries that follow entity-centric subgraphs
- Evaluate retrieval quality using importance scoring and community coverage
Prerequisite
Complete Lab 009: Retrieval-Augmented Generation first. This lab assumes familiarity with basic RAG concepts including chunking, embedding, and vector search.
IntroductionΒΆ
Traditional RAG retrieves individual chunks by semantic similarity. This works well for local queries ("What is the return policy?") but fails for global queries that require synthesizing information across many documents ("What are the major themes in Q3 earnings across all portfolio companies?").
GraphRAG solves this by building a knowledge graph from extracted entities and relationships, then clustering the graph into communities that represent thematic groups:
| Approach | Retrieval Method | Best For | Weakness |
|---|---|---|---|
| Vector RAG | Cosine similarity on embeddings | Local, specific queries | Cannot synthesize across documents |
| GraphRAG Local | Entity-centric subgraph traversal | Queries about specific entities | Misses global themes |
| GraphRAG Global | Community summaries + map-reduce | Broad, cross-document queries | Higher latency and cost |
The ScenarioΒΆ
You are building a market intelligence system for an outdoor gear company. Your corpus contains product reviews, supplier reports, and competitor analysis documents. You will extract entities, build a knowledge graph, detect communities, and compare local vs global query performance.
The knowledge graph contains 15 entities organized into 8 communities.
PrerequisitesΒΆ
| Requirement | Why |
|---|---|
| Python 3.10+ | Run analysis scripts |
pandas |
Analyze knowledge graph data |
π¦ Supporting FilesΒΆ
Download these files before starting the lab
Save all files to a lab-067/ folder in your working directory.
| File | Description | Download |
|---|---|---|
broken_graphrag.py |
Bug-fix exercise (3 bugs + self-tests) | π₯ Download |
knowledge_graph.csv |
Dataset | π₯ Download |
Step 1: Understanding GraphRAG ArchitectureΒΆ
GraphRAG extends the RAG pipeline with graph construction and community detection:
Documents β [Entity Extraction] β [Relationship Extraction] β Knowledge Graph
β
Query β [Community Detection] β [Community Summaries] β [Map-Reduce Answer]
β
[Local Subgraph] β [Entity-Centric Answer]
Key concepts:
- Entities β People, organizations, products, and concepts extracted from text
- Relationships β Connections between entities (e.g., "CompanyA supplies CompanyB")
- Communities β Clusters of densely connected entities discovered by graph algorithms
- Community Summaries β LLM-generated descriptions of each community's theme
- Importance Score β Centrality metric (0β1) indicating an entity's significance
Why Communities Matter
Communities group related entities that frequently co-occur. A global query like "What are the market trends?" can be answered by synthesizing community summaries rather than scanning every document chunk β dramatically reducing token usage while improving coverage.
Step 2: Load and Explore the Knowledge GraphΒΆ
The dataset contains 15 entities with relationships and community assignments:
import pandas as pd
kg = pd.read_csv("lab-067/knowledge_graph.csv")
print(f"Total entities: {len(kg)}")
print(f"Entity types: {sorted(kg['entity_type'].unique())}")
print(f"Communities: {sorted(kg['community_id'].unique())}")
print(f"Number of communities: {kg['community_id'].nunique()}")
print(f"\nEntities per community:")
print(kg.groupby("community_id")["entity_id"].count().sort_values(ascending=False))
Expected:
Step 3: Entity Importance AnalysisΒΆ
Analyze entity importance scores to identify key nodes in the graph:
print("Top entities by importance score:")
top_entities = kg.sort_values("importance_score", ascending=False).head(5)
print(top_entities[["entity_id", "entity_name", "entity_type", "importance_score", "community_id"]]
.to_string(index=False))
print(f"\nHighest importance entity: {kg.loc[kg['importance_score'].idxmax(), 'entity_name']} "
f"({kg['importance_score'].max():.2f})")
print(f"Average importance score: {kg['importance_score'].mean():.2f}")
Expected:
Centrality and Importance
The importance score reflects how central an entity is in the knowledge graph. Entities with high scores (like OutdoorGear Inc at 0.98) connect many other entities and communities. Queries that involve these hub entities will traverse more of the graph, providing richer context.
Step 4: Community Structure AnalysisΒΆ
Examine the community structure and themes:
print(f"Total communities: {kg['community_id'].nunique()}")
print(f"\nCommunity sizes:")
community_sizes = kg.groupby("community_id").agg(
entity_count=("entity_id", "count"),
avg_importance=("importance_score", "mean"),
entities=("entity_name", lambda x: ", ".join(x))
).sort_values("entity_count", ascending=False)
print(community_sizes.to_string())
Expected:
Community Detection
Communities are detected using the Leiden algorithm, which identifies densely connected subgraphs. Each community represents a thematic cluster β for example, one community might contain supplier-related entities while another groups competitor entities. The number and size of communities depend on the graph's connectivity structure.
Step 5: Local vs Global Query SimulationΒΆ
Simulate how local and global queries traverse the graph differently:
# Local query: find entities related to a specific entity
target_entity = kg.loc[kg["importance_score"].idxmax(), "entity_name"]
target_community = kg.loc[kg["importance_score"].idxmax(), "community_id"]
local_results = kg[kg["community_id"] == target_community]
print(f"Local query for '{target_entity}':")
print(f" Community {target_community} has {len(local_results)} entities")
print(f" Entities: {', '.join(local_results['entity_name'].tolist())}")
# Global query: summarize across all communities
print(f"\nGlobal query β all communities:")
for cid in sorted(kg["community_id"].unique()):
community = kg[kg["community_id"] == cid]
print(f" Community {cid}: {len(community)} entities β "
f"{', '.join(community['entity_name'].tolist())}")
Step 6: Graph Quality MetricsΒΆ
Evaluate the quality of the knowledge graph:
total_entities = len(kg)
total_communities = kg["community_id"].nunique()
avg_community_size = total_entities / total_communities
max_importance = kg["importance_score"].max()
min_importance = kg["importance_score"].min()
report = f"""
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GraphRAG β Knowledge Graph Quality Report β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β Total Entities: {total_entities:>5} β
β Total Communities: {total_communities:>5} β
β Avg Community Size: {avg_community_size:>5.1f} β
β Max Importance Score: {max_importance:>5.2f} β
β Min Importance Score: {min_importance:>5.2f} β
β Entity Types: {kg['entity_type'].nunique():>5} β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
"""
print(report)
π Bug-Fix ExerciseΒΆ
The file lab-067/broken_graphrag.py has 3 bugs in how it processes the knowledge graph:
| Test | What it checks | Hint |
|---|---|---|
| Test 1 | Entity count | Should count all rows in the DataFrame, not unique community IDs |
| Test 2 | Community count | Should use nunique() on community_id, not count() |
| Test 3 | Highest importance entity | Should use idxmax() on importance_score, not idxmin() |
π§ Knowledge CheckΒΆ
Q1 (Multiple Choice): What problem does GraphRAG solve that traditional vector RAG cannot?
- A) Faster embedding generation
- B) Cross-document synthesis for global queries by using community-level summaries
- C) Lower storage costs for embeddings
- D) Better tokenization of source documents
β Reveal Answer
Correct: B) Cross-document synthesis for global queries by using community-level summaries
Traditional vector RAG retrieves individual chunks by similarity, which works for local queries but fails when the answer requires synthesizing information scattered across many documents. GraphRAG builds a knowledge graph, detects communities of related entities, and uses community summaries to answer global queries via map-reduce.
Q2 (Multiple Choice): What is a 'community' in the context of GraphRAG?
- A) A group of users who share the same agent
- B) A cluster of densely connected entities in the knowledge graph that represent a thematic group
- C) A type of vector index partition
- D) A chat thread with multiple participants
β Reveal Answer
Correct: B) A cluster of densely connected entities in the knowledge graph that represent a thematic group
Communities are discovered by graph clustering algorithms like Leiden. Entities within a community are more densely connected to each other than to entities outside the community. Each community gets an LLM-generated summary that captures its theme, enabling efficient global query answering.
Q3 (Run the Lab): How many total entities are in the knowledge graph?
Load the knowledge graph CSV and count the total rows.
β Reveal Answer
15 entities
The knowledge graph contains 15 entities spanning multiple types (organizations, products, people, concepts). These entities are connected through relationships extracted from the source documents.
Q4 (Run the Lab): How many communities were detected in the knowledge graph?
Use nunique() on the community_id column.
β Reveal Answer
8 communities
The 15 entities are organized into 8 communities by the Leiden clustering algorithm. Each community represents a thematic group of related entities β for example, supplier relationships, competitor landscape, or product categories.
Q5 (Run the Lab): Which entity has the highest importance score, and what is the score?
Sort by importance_score descending and check the top entity.
β Reveal Answer
OutdoorGear Inc with an importance score of 0.98
OutdoorGear Inc is the most central entity in the knowledge graph, connecting to entities across multiple communities. Its high importance score (0.98) reflects its role as a hub β queries involving this entity will traverse more of the graph and provide richer cross-document context.
SummaryΒΆ
| Topic | What You Learned |
|---|---|
| GraphRAG | Extends RAG with knowledge graphs for cross-document synthesis |
| Entity Extraction | Identify people, organizations, and concepts from documents |
| Community Detection | Cluster related entities to discover thematic groups |
| Local Queries | Traverse entity-centric subgraphs for specific answers |
| Global Queries | Synthesize community summaries via map-reduce for broad answers |
| Importance Scoring | Rank entities by graph centrality to identify key nodes |