Lab 050: Multi-Agent Observability with GenAI Semantic ConventionsΒΆ
What You'll LearnΒΆ
- Apply GenAI semantic conventions to multi-agent systems: agent spans, model spans, tool spans
- Trace agent-to-agent handoffs, routing decisions, and retry patterns
- Distinguish
INTERNAL(agent logic) vsCLIENT(LLM/tool calls) span kinds - Analyze quality scores, token costs, and latency across a multi-agent pipeline
- Build observability dashboard metrics from raw span data
- Understand how conventions standardize telemetry across Foundry, Semantic Kernel, LangChain, AutoGen
Prerequisite
Complete Lab 049: Foundry IQ β Agent Tracing first. This lab assumes familiarity with OpenTelemetry spans, attributes, and GenAI conventions.
IntroductionΒΆ
Single-agent tracing is hard. Multi-agent tracing is exponentially harder. When a Router hands off to a Specialist, who calls tools, who passes results to a Reviewer β you need a standard way to capture every step so you can reconstruct the full execution flow.
The OpenTelemetry GenAI semantic conventions solve this with three span types:
| Span Type | Kind | Key Attributes | Example |
|---|---|---|---|
| Agent span | INTERNAL |
gen_ai.agent.name, gen_ai.agent.id |
Router, ProductSpec, Reviewer |
| Model span | CLIENT |
gen_ai.request.model, gen_ai.usage.*_tokens |
chat gpt-4o |
| Tool span | CLIENT |
gen_ai.tool.name |
search_products |
The ScenarioΒΆ
OutdoorGear Inc. has upgraded to a multi-agent system with 4 specialist agents orchestrated by a Router:
- Router Agent β classifies incoming queries and dispatches to the right specialist
- Product Specialist β handles product search and recommendations
- Order Specialist β processes order status and shipping queries
- Support Specialist β handles complaints and sensitive issues
- Reviewer Agent β checks every response for quality and policy compliance
You have 5 complex traces with 46 spans showing the full agent pipeline, including a trace with a failed review and retry.
PrerequisitesΒΆ
| Requirement | Why |
|---|---|
| Python 3.10+ | Run analysis scripts |
pandas |
Analyze span data |
| Lab 049 completed | Understanding of OpenTelemetry basics |
π¦ Supporting FilesΒΆ
Download these files before starting the lab
Save all files to a lab-050/ folder in your working directory.
| File | Description | Download |
|---|---|---|
broken_conventions.py |
Bug-fix exercise (3 bugs + self-tests) | π₯ Download |
dashboard_builder.py |
Starter script with TODOs | π₯ Download |
multi_agent_spans.csv |
Dataset | π₯ Download |
Step 1: Understanding Multi-Agent Trace StructureΒΆ
In a multi-agent system, the trace forms a tree:
root: router_agent (INTERNAL)
βββ classify_query (CLIENT, gpt-4o-mini)
βββ product_specialist (INTERNAL)
β βββ search_reasoning (CLIENT, gpt-4o)
β βββ search_products (CLIENT, tool)
β βββ format_response (CLIENT, gpt-4o)
βββ reviewer (INTERNAL)
βββ quality_check (CLIENT, gpt-4o-mini)
Key conventions:
- Agent spans are
INTERNALβ they represent the agent's own logic and orchestration - LLM calls are
CLIENTβ outgoing requests to model endpoints - Tool calls are
CLIENTβ outgoing requests to tools/APIs - Parent-child relationships show the handoff chain
gen_ai.agent.nameis set ONLY on agent spans, not on LLM/tool spans
Why INTERNAL for Agents?
An agent's decision-making happens locally (routing, planning, memory retrieval). It doesn't cross a network boundary β so it's INTERNAL. The LLM call that the agent makes is CLIENT because it goes over the network to an API.
Step 2: Load and Explore the Trace DataΒΆ
The dataset has 46 spans across 5 traces:
import pandas as pd
spans = pd.read_csv("lab-050/multi_agent_spans.csv")
print(f"Total spans: {len(spans)}")
print(f"Traces: {spans['trace_id'].nunique()}")
print(f"\nSpans per trace:")
print(spans.groupby("trace_id")["span_id"].count())
Expected:
| Trace | Spans | Scenario |
|---|---|---|
| A001 | 8 | Product search (simple) |
| A002 | 10 | Complex order query |
| A003 | 9 | Complaint handling |
| A004 | 5 | FAQ (no reviewer) |
| A005 | 14 | Refund with failed review + retry |
Step 3: Agent Span AnalysisΒΆ
Extract and analyze agent spans:
agent_spans = spans[(spans["kind"] == "INTERNAL") & (spans["agent_name"].notna())]
print(f"Total agent spans: {len(agent_spans)}")
print(f"Unique agents: {sorted(agent_spans['agent_name'].unique())}")
print(f"\nSpans per agent:")
print(agent_spans["agent_name"].value_counts().sort_index())
Expected:
Total agent spans: 16
Unique agents: ['FAQSpec', 'OrderSpec', 'ProductSpec', 'RefundSpec', 'Reviewer', 'Router', 'SupportSpec']
Reviewer 5
Router 5
RefundSpec 2
...
Insight
Router appears in all 5 traces β it's the entry point. Reviewer appears in 4 traces (not A004, the simple FAQ). RefundSpec appears twice in trace A005 because the first attempt failed review and was retried.
Step 4: LLM Token Usage AnalysisΒΆ
Analyze token consumption across all model calls:
llm_spans = spans[spans["model"].notna()]
print(f"Total LLM calls: {len(llm_spans)}")
by_model = llm_spans.groupby("model").agg(
calls=("span_id", "count"),
total_input=("input_tokens", "sum"),
total_output=("output_tokens", "sum"),
).reset_index()
by_model["total_tokens"] = by_model["total_input"] + by_model["total_output"]
print(by_model.to_string(index=False))
total_tokens = int(llm_spans["input_tokens"].sum() + llm_spans["output_tokens"].sum())
print(f"\nGrand total: {total_tokens:,} tokens")
Expected:
| Model | Calls | Input | Output | Total |
|---|---|---|---|---|
| gpt-4o | 12 | 3,830 | 1,890 | 5,720 |
| gpt-4o-mini | 10 | 1,045 | 177 | 1,222 |
| Total | 22 | 4,875 | 2,067 | 6,942 |
Cost Insight
gpt-4o handles the heavy reasoning (82% of tokens) while gpt-4o-mini does lightweight classification and quality checks (18%). This is a cost-efficient pattern β use expensive models only for complex reasoning.
Step 5: Tool Call AnalysisΒΆ
tool_spans = spans[spans["tool_name"].notna()]
print(f"Total tool calls: {len(tool_spans)}")
print(f"\nTools used:")
print(tool_spans["tool_name"].value_counts())
trace_tools = tool_spans.groupby("trace_id").size()
print(f"\nTrace with most tool calls: {trace_tools.idxmax()} ({trace_tools.max()} calls)")
Expected:
Total tool calls: 8
search_products 1
get_order_status 1
get_shipping_info 1
calculate_eta 1
get_customer_history 1
search_faq 1
get_order_details 1
check_refund_policy 1
Trace with most tool calls: A002 (3 calls)
Step 6: Quality Score AnalysisΒΆ
Reviewer agents assign quality scores. Analyze them:
quality_spans = spans[spans["quality_score"].notna()]
print(f"Quality assessments: {len(quality_spans)}")
print(f"Average quality: {quality_spans['quality_score'].mean():.3f}")
print(f"Min quality: {quality_spans['quality_score'].min():.2f}")
print(f"Max quality: {quality_spans['quality_score'].max():.2f}")
# Traces that fell below the quality threshold
below_threshold = quality_spans[quality_spans["quality_score"] < 0.8]
print(f"\nTraces below 0.8 threshold: {below_threshold['trace_id'].unique().tolist()}")
Expected:
Quality assessments: 5
Average quality: 0.790
Min quality: 0.45
Max quality: 0.95
Traces below 0.8 threshold: ['A003', 'A005']
Investigating the Failed Review (Trace A005)ΒΆ
a005 = spans[spans["trace_id"] == "A005"].sort_values("span_id")
print(a005[["span_id", "span_name", "agent_name", "kind", "quality_score", "status"]]
.to_string(index=False))
This shows the retry pattern: the first reviewer check (s40) scored 0.45 with status ERROR. The Refund Specialist was re-invoked (s42), produced a revised response, and the second reviewer check (s45) passed at 0.85.
Step 7: Build Dashboard MetricsΒΆ
Combine everything into a dashboard summary:
# Overall metrics
total_traces = spans["trace_id"].nunique()
total_spans = len(spans)
total_agent_spans = len(agent_spans)
total_llm_calls = len(llm_spans)
total_tools = len(tool_spans)
error_spans = spans[spans["status"] == "ERROR"]
avg_quality = quality_spans["quality_score"].mean()
dashboard = f"""
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Multi-Agent Observability Dashboard β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ£
β Traces: {total_traces:>5} β
β Total Spans: {total_spans:>5} β
β Agent Spans: {total_agent_spans:>5} (INTERNAL) β
β LLM Calls: {total_llm_calls:>5} (CLIENT) β
β Tool Calls: {total_tools:>5} (CLIENT) β
β Error Spans: {len(error_spans):>5} β
β Total Tokens: {total_tokens:>5,} β
β Avg Quality: {avg_quality:>5.3f} β
β Below Threshold: {len(below_threshold):>5} traces β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
"""
print(dashboard)
π Bug-Fix ExerciseΒΆ
The file lab-050/broken_conventions.py has 3 bugs in how it interprets GenAI semantic conventions:
| Test | What it checks | Hint |
|---|---|---|
| Test 1 | Agent names come from agent_name, not span_name |
Which column has the agent identity? |
| Test 2 | Agent spans must be INTERNAL kind AND have an agent_name |
Don't count LLM/tool spans |
| Test 3 | Total tokens = input + output | Don't forget output_tokens |
π§ Knowledge CheckΒΆ
Q1 (Multiple Choice): In GenAI semantic conventions, which span kind should be used for an agent's internal routing/planning logic?
- A) CLIENT β because the agent is a client of the LLM
- B) SERVER β because the agent serves user requests
- C) INTERNAL β because routing happens locally, not over the network
- D) PRODUCER β because the agent produces responses
β Reveal Answer
Correct: C) INTERNAL
Agent decision-making (routing, planning, memory retrieval) happens within the process β it doesn't cross a network boundary. CLIENT is used for outgoing calls to LLMs and tools. The convention is: agent logic = INTERNAL, external calls = CLIENT.
Q2 (Multiple Choice): Why does trace A005 have 14 spans while A001 has only 8?
- A) A005 uses a larger model
- B) A005 had a failed quality review and required a retry loop
- C) A005 has more user input tokens
- D) A005 uses a different routing algorithm
β Reveal Answer
Correct: B) A005 had a failed quality review and required a retry loop
The Reviewer scored A005's first response at 0.45 (ERROR). The system re-invoked the Refund Specialist to revise the response, then the Reviewer checked again (score: 0.85, OK). This retry added extra spans: second specialist (2 LLM calls) + second reviewer (1 LLM call) = 5 additional spans.
Q3 (Run the Lab): How many total agent spans (kind=INTERNAL with an agent_name) are there across all 5 traces?
Filter the spans DataFrame for kind == "INTERNAL" and agent_name not null.
β Reveal Answer
16 agent spans
Across 5 traces: A001(3) + A002(3) + A003(3) + A004(2) + A005(5) = 16. A004 has fewer because it skips the Reviewer. A005 has more because of the retry (RefundSpecΓ2 + ReviewerΓ2).
Q4 (Run the Lab): Which trace has the most tool calls, and how many?
Group tool spans by trace_id and find the maximum.
β Reveal Answer
Trace A002 β 3 tool calls
A002 (complex order query) called: get_order_status, get_shipping_info, and calculate_eta. This is the most tool-intensive trace. A005 has 2 tool calls, and the rest have 1 each.
Q5 (Run the Lab): What is the average quality score across all reviewer assessments?
Filter for spans with a non-null quality_score and calculate the mean.
β Reveal Answer
0.790
Quality scores from reviewer spans: A001 (0.95), A002 (0.92), A003 (0.78), A005-first (0.45), A005-retry (0.85). A004 (FAQ) has no reviewer. The data has 5 quality_score entries. Average = (0.95 + 0.92 + 0.78 + 0.45 + 0.85) / 5 = 0.790. Two traces (A003 and A005) fell below the 0.8 quality threshold.
SummaryΒΆ
| Topic | What You Learned |
|---|---|
| GenAI Conventions | Standard attributes: agent.name, request.model, usage.tokens |
| Span Kinds | INTERNAL (agent logic) vs CLIENT (LLM/tool calls) |
| Trace Hierarchy | Parent-child spans showing agent handoffs |
| Retry Patterns | Failed reviews trigger retry loops (visible in traces) |
| Dashboard Metrics | Agent counts, token usage, tool calls, quality scores |
| Cross-Framework | Same conventions work across Foundry, SK, LangChain, AutoGen |