Lab 074: Foundry Agent Service β Production Multi-Agent DeploymentΒΆ
What You'll LearnΒΆ
- What the Foundry Agent Service is and how it orchestrates production multi-agent systems
- How agent types (specialist, orchestrator) work together in a deployment
- Analyze agent fleet health: request volumes, latency, error rates, and status
- Identify degraded agents and configuration risks (e.g., disabled content filters)
- Build a fleet health dashboard for production monitoring
IntroductionΒΆ
The Azure AI Foundry Agent Service provides a managed platform for deploying, orchestrating, and monitoring multi-agent systems at enterprise scale. Instead of building custom orchestration, you define agents with specific tools, memory, and models β and the service handles routing, state management, and scaling.
Agent TypesΒΆ
| Type | Role | Example |
|---|---|---|
| Orchestrator | Routes requests to specialists, manages conversation flow | SupportRouter, Coordinator |
| Specialist | Handles a specific domain with dedicated tools and memory | ProductAdvisor, OrderProcessor |
The ScenarioΒΆ
You are a Platform SRE managing a multi-agent deployment for an e-commerce company. The fleet has 8 agents β 2 orchestrators and 6 specialists β running on Azure Container Apps. You've been alerted that one agent is degraded and need to investigate.
Your dataset (foundry_agents.csv) contains the current fleet status. Your job: analyze health metrics, identify issues, and produce a fleet status report.
Mock Data
This lab uses a mock agent fleet CSV that mirrors the metrics you'd see in Azure AI Foundry's monitoring dashboard. The patterns (latency spikes, error rates, degraded status) represent common production scenarios.
PrerequisitesΒΆ
| Requirement | Why |
|---|---|
| Python 3.10+ | Run the analysis scripts |
pandas library |
Data manipulation |
π¦ Supporting FilesΒΆ
Download these files before starting the lab
Save all files to a lab-074/ folder in your working directory.
| File | Description | Download |
|---|---|---|
broken_foundry.py |
Bug-fix exercise (3 bugs + self-tests) | π₯ Download |
foundry_agents.csv |
Dataset | π₯ Download |
Step 1: Understand the Fleet ArchitectureΒΆ
Before analyzing data, understand how the agents fit together:
βββββββββββββββββββ
β Coordinator β (orchestrator)
β FA05 β
ββββββββββ¬βββββββββ
β routes to
ββββββββββββββββΌβββββββββββββββ
βΌ βΌ βΌ
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
βSupportRouter β βProductAdvisorβ βOrderProcessorβ
β FA03 β β FA01 β β FA02 β
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β
ββββββββββΌβββββββββ¬βββββββββββββββ
βΌ βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββ
βInventoryββQuality ββAnalyticsββLegacyBridgeβ
β FA04 ββ FA06 ββ FA07 ββ FA08 β
ββββββββββββββββββββββββββββββββββββββββββ
Key Configuration FieldsΒΆ
| Field | Description |
|---|---|
| memory_type | How the agent persists state: cosmos_db (durable), ai_search (vector), session_only (ephemeral), none |
| deployment | Infrastructure: container_apps (managed) or vm (self-hosted) |
| content_filter | Whether Azure AI content safety is enabled or disabled |
| status | Agent health: active or degraded |
Step 2: Load and Explore the Fleet DataΒΆ
import pandas as pd
df = pd.read_csv("lab-074/foundry_agents.csv")
print(f"Total agents: {len(df)}")
print(f"Agent types: {df['agent_type'].value_counts().to_dict()}")
print(f"Statuses: {df['status'].value_counts().to_dict()}")
print(f"\nFull fleet:")
print(df[["agent_id", "agent_name", "agent_type", "model", "status"]].to_string(index=False))
Expected output:
Total agents: 8
Agent types: {'specialist': 6, 'orchestrator': 2}
Statuses: {'active': 7, 'degraded': 1}
Step 3: Analyze Request Volume and Load DistributionΒΆ
How is traffic distributed across the fleet?
total_requests = df["requests_24h"].sum()
print(f"Total 24h requests across fleet: {total_requests:,}")
print("\nRequest distribution:")
for _, row in df.sort_values("requests_24h", ascending=False).iterrows():
pct = row["requests_24h"] / total_requests * 100
bar = "β" * int(pct / 2)
print(f" {row['agent_name']:>20s}: {row['requests_24h']:>5,} ({pct:>5.1f}%) {bar}")
Expected output:
| Agent | Requests | Share |
|---|---|---|
| Coordinator | 3,200 | 34.1% |
| SupportRouter | 2,100 | 22.4% |
| ProductAdvisor | 1,250 | 13.3% |
| OrderProcessor | 890 | 9.5% |
| QualityReviewer | 780 | 8.3% |
| InventoryMonitor | 560 | 6.0% |
| AnalyticsAgent | 420 | 4.5% |
| LegacyBridge | 180 | 1.9% |
Insight
The Coordinator orchestrator handles 34% of all traffic β it's the entry point for most requests. If it goes down, the entire system is affected. The SupportRouter is the second-busiest, routing customer support queries to specialists.
Step 4: Identify Degraded and At-Risk AgentsΒΆ
4a β Degraded AgentsΒΆ
degraded = df[df["status"] == "degraded"]
print(f"Degraded agents: {len(degraded)}")
for _, agent in degraded.iterrows():
print(f"\n Agent: {agent['agent_name']} ({agent['agent_id']})")
print(f" Error rate: {agent['error_rate_pct']}%")
print(f" Avg latency: {agent['avg_latency_ms']}ms")
print(f" Requests: {agent['requests_24h']}")
Expected output:
4b β High Error Rate AgentsΒΆ
high_error = df[df["error_rate_pct"] > 5.0]
print(f"\nAgents with error rate > 5%: {len(high_error)}")
for _, agent in high_error.iterrows():
print(f" {agent['agent_name']}: {agent['error_rate_pct']}% errors")
4c β Content Filter StatusΒΆ
disabled_filter = df[df["content_filter"] == "disabled"]
print(f"\nAgents with disabled content filter: {len(disabled_filter)}")
for _, agent in disabled_filter.iterrows():
print(f" {agent['agent_name']} ({agent['agent_id']}) β deployment: {agent['deployment']}")
Security Risk
LegacyBridge (FA08) has its content filter disabled and runs on a self-hosted VM. This is a compliance risk β all production agents should have content safety enabled, especially those handling customer data.
Step 5: Analyze Memory and Infrastructure PatternsΒΆ
print("Memory type distribution:")
print(df.groupby("memory_type")["agent_name"].apply(list).to_string())
print("\nDeployment distribution:")
print(df.groupby("deployment")["agent_name"].apply(list).to_string())
# Agents without durable memory
no_durable = df[df["memory_type"].isin(["session_only", "none"])]
print(f"\nAgents without durable memory: {len(no_durable)}")
for _, agent in no_durable.iterrows():
print(f" {agent['agent_name']}: memory={agent['memory_type']}")
# Latency by model
print("\nAvg latency by model:")
for model, group in df.groupby("model"):
print(f" {model}: {group['avg_latency_ms'].mean():.0f}ms")
Step 6: Build the Fleet Health ReportΒΆ
avg_latency = df["avg_latency_ms"].mean()
avg_error = df["error_rate_pct"].mean()
report = f"""# π Foundry Agent Service β Fleet Health Report
## Fleet Overview
| Metric | Value |
|--------|-------|
| Total Agents | {len(df)} |
| Orchestrators | {(df['agent_type'] == 'orchestrator').sum()} |
| Specialists | {(df['agent_type'] == 'specialist').sum()} |
| Active | {(df['status'] == 'active').sum()} |
| Degraded | {(df['status'] == 'degraded').sum()} |
| Total 24h Requests | {total_requests:,} |
| Avg Latency | {avg_latency:.0f}ms |
| Avg Error Rate | {avg_error:.1f}% |
## Alerts
| Priority | Issue | Agent | Action |
|----------|-------|-------|--------|
| π΄ High | Degraded status, 8.5% error rate | AnalyticsAgent (FA07) | Investigate AI Search connection |
| π‘ Medium | Content filter disabled | LegacyBridge (FA08) | Enable content safety |
| π‘ Medium | 12% error rate, VM deployment | LegacyBridge (FA08) | Migrate to Container Apps |
| π’ Low | Session-only memory | SupportRouter (FA03) | Consider durable memory for analytics |
## Recommendations
1. **Fix AnalyticsAgent** β likely an AI Search index connectivity issue causing 8.5% errors
2. **Enable content filter on LegacyBridge** β compliance requirement for production
3. **Migrate LegacyBridge to Container Apps** β self-hosted VMs lack auto-scaling and monitoring
4. **Add monitoring dashboards** β track per-agent latency and error rate trends
"""
print(report)
with open("lab-074/fleet_report.md", "w") as f:
f.write(report)
print("πΎ Saved to lab-074/fleet_report.md")
π Bug-Fix ExerciseΒΆ
The file lab-074/broken_foundry.py contains 3 bugs that produce incorrect fleet metrics. Can you find and fix them all?
Run the self-tests to see which ones fail:
You should see 3 failed tests. Each test corresponds to one bug:
| Test | What it checks | Hint |
|---|---|---|
| Test 1 | Total 24h requests | Should sum requests, not average them |
| Test 2 | Degraded agent count | Should count degraded status, not active |
| Test 3 | Agents without durable memory | Should count none/session_only, not cosmos_db |
Fix all 3 bugs, then re-run. When you see All passed!, you're done!
π§ Knowledge CheckΒΆ
Q1 (Multiple Choice): What is the role of an orchestrator agent in a Foundry multi-agent deployment?
- A) It performs a specific domain task like order processing
- B) It routes requests to specialist agents and manages conversation flow
- C) It stores agent memory in Cosmos DB
- D) It monitors agent health and restarts failed agents
β Reveal Answer
Correct: B) It routes requests to specialist agents and manages conversation flow
Orchestrator agents act as the "traffic controller" in a multi-agent system. They receive incoming requests, determine which specialist(s) should handle them, route the conversation accordingly, and manage the overall flow. Specialists handle the domain-specific work.
Q2 (Multiple Choice): Why is a disabled content filter a security risk for production agents?
- A) It makes the agent slower
- B) It allows the agent to generate harmful, biased, or policy-violating content
- C) It prevents the agent from accessing external APIs
- D) It increases token costs
β Reveal Answer
Correct: B) It allows the agent to generate harmful, biased, or policy-violating content
Azure AI Content Safety filters detect and block harmful content (hate speech, violence, self-harm, sexual content). Disabling the filter means the agent can produce or respond to such content without guardrails β a compliance and reputational risk in any production deployment.
Q3 (Run the Lab): What is the total number of requests across the entire fleet in the last 24 hours?
Run the Step 3 analysis on π₯ foundry_agents.csv and check the results.
β Reveal Answer
9,380 requests
Sum of all agent requests_24h values: 1,250 + 890 + 2,100 + 560 + 3,200 + 780 + 420 + 180 = 9,380.
Q4 (Run the Lab): How many agents are in a degraded state?
Run the Step 4a analysis to find out.
β Reveal Answer
1 agent
Only AnalyticsAgent (FA07) is in a degraded state, with an 8.5% error rate and 850ms average latency β significantly worse than the other agents. This likely indicates a backend connectivity issue with its AI Search memory store.
Q5 (Run the Lab): How many agents have their content filter disabled?
Run the Step 4c analysis to check content filter status.
β Reveal Answer
1 agent
Only LegacyBridge (FA08) has content_filter=disabled. It's also the only agent deployed on a self-hosted VM rather than Container Apps, and has the highest error rate (12.0%) in the fleet. This agent needs immediate attention.
SummaryΒΆ
| Topic | What You Learned |
|---|---|
| Foundry Agent Service | Managed platform for multi-agent orchestration and deployment |
| Agent Types | Orchestrators route; specialists execute domain tasks |
| Fleet Monitoring | Track requests, latency, error rates, and status per agent |
| Degraded Detection | Identify agents with elevated error rates or latency |
| Content Safety | All production agents should have content filters enabled |
| Memory Patterns | Cosmos DB for durable, AI Search for vector, session_only for ephemeral |
Next StepsΒΆ
- Lab 034 β Multi-Agent with Semantic Kernel (building the agents themselves)
- Lab 033 β Agent Observability with Application Insights (deeper monitoring)
- Lab 030 β Foundry Agent + MCP (connecting agents to external tools)
- Lab 075 β Power BI Copilot (visualizing fleet data with AI-assisted dashboards)