Skip to content

Lab 067: GraphRAG β€” Knowledge Graphs for Cross-Document RetrievalΒΆ

Level: L300 Path: All paths Time: ~90 min πŸ’° Cost: Free β€” Mock data (no Azure OpenAI or graph DB required)

What You'll LearnΒΆ

  • What GraphRAG is and how it differs from traditional vector-only RAG
  • Build a knowledge graph from entity and relationship extraction
  • Detect communities using graph clustering algorithms
  • Execute global queries that synthesize across all documents
  • Execute local queries that follow entity-centric subgraphs
  • Evaluate retrieval quality using importance scoring and community coverage

Prerequisite

Complete Lab 009: Retrieval-Augmented Generation first. This lab assumes familiarity with basic RAG concepts including chunking, embedding, and vector search.

IntroductionΒΆ

Traditional RAG retrieves individual chunks by semantic similarity. This works well for local queries ("What is the return policy?") but fails for global queries that require synthesizing information across many documents ("What are the major themes in Q3 earnings across all portfolio companies?").

GraphRAG solves this by building a knowledge graph from extracted entities and relationships, then clustering the graph into communities that represent thematic groups:

Approach Retrieval Method Best For Weakness
Vector RAG Cosine similarity on embeddings Local, specific queries Cannot synthesize across documents
GraphRAG Local Entity-centric subgraph traversal Queries about specific entities Misses global themes
GraphRAG Global Community summaries + map-reduce Broad, cross-document queries Higher latency and cost

The ScenarioΒΆ

You are building a market intelligence system for an outdoor gear company. Your corpus contains product reviews, supplier reports, and competitor analysis documents. You will extract entities, build a knowledge graph, detect communities, and compare local vs global query performance.

The knowledge graph contains 15 entities organized into 8 communities.


PrerequisitesΒΆ

Requirement Why
Python 3.10+ Run analysis scripts
pandas Analyze knowledge graph data
pip install pandas

Quick Start with GitHub Codespaces

Open in GitHub Codespaces

All dependencies are pre-installed in the devcontainer.

πŸ“¦ Supporting FilesΒΆ

Download these files before starting the lab

Save all files to a lab-067/ folder in your working directory.

File Description Download
broken_graphrag.py Bug-fix exercise (3 bugs + self-tests) πŸ“₯ Download
knowledge_graph.csv Dataset πŸ“₯ Download

Step 1: Understanding GraphRAG ArchitectureΒΆ

GraphRAG extends the RAG pipeline with graph construction and community detection:

Documents β†’ [Entity Extraction] β†’ [Relationship Extraction] β†’ Knowledge Graph
                                                                     ↓
Query β†’ [Community Detection] β†’ [Community Summaries] β†’ [Map-Reduce Answer]
                                        ↓
                              [Local Subgraph] β†’ [Entity-Centric Answer]

Key concepts:

  1. Entities β€” People, organizations, products, and concepts extracted from text
  2. Relationships β€” Connections between entities (e.g., "CompanyA supplies CompanyB")
  3. Communities β€” Clusters of densely connected entities discovered by graph algorithms
  4. Community Summaries β€” LLM-generated descriptions of each community's theme
  5. Importance Score β€” Centrality metric (0–1) indicating an entity's significance

Why Communities Matter

Communities group related entities that frequently co-occur. A global query like "What are the market trends?" can be answered by synthesizing community summaries rather than scanning every document chunk β€” dramatically reducing token usage while improving coverage.


Step 2: Load and Explore the Knowledge GraphΒΆ

The dataset contains 15 entities with relationships and community assignments:

import pandas as pd

kg = pd.read_csv("lab-067/knowledge_graph.csv")
print(f"Total entities: {len(kg)}")
print(f"Entity types: {sorted(kg['entity_type'].unique())}")
print(f"Communities: {sorted(kg['community_id'].unique())}")
print(f"Number of communities: {kg['community_id'].nunique()}")
print(f"\nEntities per community:")
print(kg.groupby("community_id")["entity_id"].count().sort_values(ascending=False))

Expected:

Total entities: 15
Communities: [0, 1, 2, 3, 4, 5, 6, 7]
Number of communities: 8

Step 3: Entity Importance AnalysisΒΆ

Analyze entity importance scores to identify key nodes in the graph:

print("Top entities by importance score:")
top_entities = kg.sort_values("importance_score", ascending=False).head(5)
print(top_entities[["entity_id", "entity_name", "entity_type", "importance_score", "community_id"]]
      .to_string(index=False))

print(f"\nHighest importance entity: {kg.loc[kg['importance_score'].idxmax(), 'entity_name']} "
      f"({kg['importance_score'].max():.2f})")
print(f"Average importance score: {kg['importance_score'].mean():.2f}")

Expected:

Highest importance entity: OutdoorGear Inc (0.98)

Centrality and Importance

The importance score reflects how central an entity is in the knowledge graph. Entities with high scores (like OutdoorGear Inc at 0.98) connect many other entities and communities. Queries that involve these hub entities will traverse more of the graph, providing richer context.


Step 4: Community Structure AnalysisΒΆ

Examine the community structure and themes:

print(f"Total communities: {kg['community_id'].nunique()}")
print(f"\nCommunity sizes:")
community_sizes = kg.groupby("community_id").agg(
    entity_count=("entity_id", "count"),
    avg_importance=("importance_score", "mean"),
    entities=("entity_name", lambda x: ", ".join(x))
).sort_values("entity_count", ascending=False)
print(community_sizes.to_string())

Expected:

Total communities: 8

Community Detection

Communities are detected using the Leiden algorithm, which identifies densely connected subgraphs. Each community represents a thematic cluster β€” for example, one community might contain supplier-related entities while another groups competitor entities. The number and size of communities depend on the graph's connectivity structure.


Step 5: Local vs Global Query SimulationΒΆ

Simulate how local and global queries traverse the graph differently:

# Local query: find entities related to a specific entity
target_entity = kg.loc[kg["importance_score"].idxmax(), "entity_name"]
target_community = kg.loc[kg["importance_score"].idxmax(), "community_id"]
local_results = kg[kg["community_id"] == target_community]
print(f"Local query for '{target_entity}':")
print(f"  Community {target_community} has {len(local_results)} entities")
print(f"  Entities: {', '.join(local_results['entity_name'].tolist())}")

# Global query: summarize across all communities
print(f"\nGlobal query β€” all communities:")
for cid in sorted(kg["community_id"].unique()):
    community = kg[kg["community_id"] == cid]
    print(f"  Community {cid}: {len(community)} entities β€” "
          f"{', '.join(community['entity_name'].tolist())}")

Step 6: Graph Quality MetricsΒΆ

Evaluate the quality of the knowledge graph:

total_entities = len(kg)
total_communities = kg["community_id"].nunique()
avg_community_size = total_entities / total_communities
max_importance = kg["importance_score"].max()
min_importance = kg["importance_score"].min()

report = f"""
╔════════════════════════════════════════════════════════╗
β•‘     GraphRAG β€” Knowledge Graph Quality Report          β•‘
╠════════════════════════════════════════════════════════╣
β•‘ Total Entities:              {total_entities:>5}                     β•‘
β•‘ Total Communities:           {total_communities:>5}                     β•‘
β•‘ Avg Community Size:          {avg_community_size:>5.1f}                     β•‘
β•‘ Max Importance Score:        {max_importance:>5.2f}                     β•‘
β•‘ Min Importance Score:        {min_importance:>5.2f}                     β•‘
β•‘ Entity Types:                {kg['entity_type'].nunique():>5}                     β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
"""
print(report)

πŸ› Bug-Fix ExerciseΒΆ

The file lab-067/broken_graphrag.py has 3 bugs in how it processes the knowledge graph:

python lab-067/broken_graphrag.py
Test What it checks Hint
Test 1 Entity count Should count all rows in the DataFrame, not unique community IDs
Test 2 Community count Should use nunique() on community_id, not count()
Test 3 Highest importance entity Should use idxmax() on importance_score, not idxmin()

🧠 Knowledge Check¢

Q1 (Multiple Choice): What problem does GraphRAG solve that traditional vector RAG cannot?
  • A) Faster embedding generation
  • B) Cross-document synthesis for global queries by using community-level summaries
  • C) Lower storage costs for embeddings
  • D) Better tokenization of source documents
βœ… Reveal Answer

Correct: B) Cross-document synthesis for global queries by using community-level summaries

Traditional vector RAG retrieves individual chunks by similarity, which works for local queries but fails when the answer requires synthesizing information scattered across many documents. GraphRAG builds a knowledge graph, detects communities of related entities, and uses community summaries to answer global queries via map-reduce.

Q2 (Multiple Choice): What is a 'community' in the context of GraphRAG?
  • A) A group of users who share the same agent
  • B) A cluster of densely connected entities in the knowledge graph that represent a thematic group
  • C) A type of vector index partition
  • D) A chat thread with multiple participants
βœ… Reveal Answer

Correct: B) A cluster of densely connected entities in the knowledge graph that represent a thematic group

Communities are discovered by graph clustering algorithms like Leiden. Entities within a community are more densely connected to each other than to entities outside the community. Each community gets an LLM-generated summary that captures its theme, enabling efficient global query answering.

Q3 (Run the Lab): How many total entities are in the knowledge graph?

Load the knowledge graph CSV and count the total rows.

βœ… Reveal Answer

15 entities

The knowledge graph contains 15 entities spanning multiple types (organizations, products, people, concepts). These entities are connected through relationships extracted from the source documents.

Q4 (Run the Lab): How many communities were detected in the knowledge graph?

Use nunique() on the community_id column.

βœ… Reveal Answer

8 communities

The 15 entities are organized into 8 communities by the Leiden clustering algorithm. Each community represents a thematic group of related entities β€” for example, supplier relationships, competitor landscape, or product categories.

Q5 (Run the Lab): Which entity has the highest importance score, and what is the score?

Sort by importance_score descending and check the top entity.

βœ… Reveal Answer

OutdoorGear Inc with an importance score of 0.98

OutdoorGear Inc is the most central entity in the knowledge graph, connecting to entities across multiple communities. Its high importance score (0.98) reflects its role as a hub β€” queries involving this entity will traverse more of the graph and provide richer cross-document context.


SummaryΒΆ

Topic What You Learned
GraphRAG Extends RAG with knowledge graphs for cross-document synthesis
Entity Extraction Identify people, organizations, and concepts from documents
Community Detection Cluster related entities to discover thematic groups
Local Queries Traverse entity-centric subgraphs for specific answers
Global Queries Synthesize community summaries via map-reduce for broad answers
Importance Scoring Rank entities by graph centrality to identify key nodes

Next StepsΒΆ

  • Lab 009 β€” RAG Basics (foundational retrieval patterns)
  • Lab 068 β€” Hybrid Search (complementary retrieval strategies)
  • Lab 065 β€” Purview DSPM for AI (governance for RAG pipelines)