Skip to content

Lab 064: Securing MCP at Scale with Azure API ManagementΒΆ

Level: L400 Path: βš™οΈ Pro Code Time: ~90 min πŸ’° Cost: Free β€” Mock server data (no Azure subscription required)

What You'll LearnΒΆ

  • Use Azure API Management (APIM) as a centralized gateway for MCP servers
  • Enforce OAuth 2.0 authentication across all MCP endpoints
  • Apply rate limiting, DLP policies, and logging to MCP traffic
  • Audit MCP server compliance across teams and identify security gaps
  • Analyze error rates, latency, and call volumes across an MCP fleet

Prerequisite

Complete Lab 012: What Is MCP? and Lab 020: MCP Server (Python) first. This lab assumes familiarity with MCP architecture and tool-serving patterns.

IntroductionΒΆ

As organizations scale their AI agent deployments, the number of MCP servers grows rapidly β€” each team builds its own, with different authentication schemes, rate limits, and data-loss-prevention (DLP) controls. Without centralized governance, you end up with a patchwork of inconsistent security policies.

Azure API Management solves this by sitting in front of all MCP servers as a unified gateway:

Concern Without APIM With APIM
Authentication Each server rolls its own (API key, basic, OAuth…) Centralized OAuth 2.0 with Azure AD
Rate Limiting No limits or inconsistent per-server limits Uniform policy across all endpoints
DLP No scanning of tool inputs/outputs Content inspection and PII redaction
Monitoring Scattered logs, no unified view Centralized metrics, alerts, and dashboards

The ScenarioΒΆ

You are a Platform Security Engineer at a company running 10 MCP servers across 6 teams. Management wants a compliance report: which servers meet the security baseline (OAuth + DLP + logging), which don't, and what the risk exposure looks like.

You have a fleet inventory dataset with authentication types, rate limits, DLP status, logging status, call volumes, latency, and error rates.


PrerequisitesΒΆ

Requirement Why
Python 3.10+ Run analysis scripts
pandas Analyze server fleet data
pip install pandas

Quick Start with GitHub Codespaces

Open in GitHub Codespaces

All dependencies are pre-installed in the devcontainer.

πŸ“¦ Supporting FilesΒΆ

Download these files before starting the lab

Save all files to a lab-064/ folder in your working directory.

File Description Download
broken_apim.py Bug-fix exercise (3 bugs + self-tests) πŸ“₯ Download
mcp_servers.csv Dataset πŸ“₯ Download

Step 1: Understanding the APIM Security ModelΒΆ

When APIM sits in front of MCP servers, every tool call flows through a policy pipeline:

Agent β†’ APIM Gateway β†’ [Auth Policy] β†’ [Rate Limit] β†’ [DLP Scan] β†’ MCP Server
                                                                        ↓
Agent ← APIM Gateway ← [Response DLP] ← [Logging] ←────────────── Response

Key policies for MCP:

Policy Purpose Example
validate-jwt Verify OAuth 2.0 tokens Reject calls without valid Azure AD token
rate-limit-by-key Throttle per client/team 100 RPM per agent
set-body DLP content inspection Redact SSN, credit card numbers from tool outputs
log-to-eventhub Centralized audit logging Every tool call β†’ Event Hub β†’ Log Analytics

Why OAuth Over API Keys?

API keys have no user identity, no token expiry, and no scope control. If a key leaks, anyone can call the MCP server until you manually rotate it. OAuth 2.0 tokens expire automatically, carry user/app identity, and can be scoped to specific tools.


Step 2: Load and Explore the MCP Server FleetΒΆ

The dataset contains 10 MCP servers across 6 teams:

import pandas as pd

servers = pd.read_csv("lab-064/mcp_servers.csv")
print(f"Total MCP servers: {len(servers)}")
print(f"Teams: {sorted(servers['team'].unique())}")
print(f"\nServers per team:")
print(servers.groupby("team")["server_name"].count().sort_values(ascending=False))

Expected:

Total MCP servers: 10
Teams: ['Analytics', 'Commerce', 'Finance', 'HR', 'Logistics', 'Marketing', 'Operations', 'Support']

Commerce      2
Operations    2
Analytics     1
Finance       1
HR            1
Logistics     1
Marketing     1
Support       1

Step 3: Compliance AuditΒΆ

A server is compliant if it has all three: OAuth 2.0 authentication, DLP enabled, and logging enabled. Check the fleet:

compliant = servers[servers["compliant"] == True]
non_compliant = servers[servers["compliant"] == False]

print(f"Compliant servers:     {len(compliant)}")
print(f"Non-compliant servers: {len(non_compliant)}")
print(f"\nNon-compliant details:")
print(non_compliant[["server_name", "team", "auth_type", "has_dlp", "has_logging"]].to_string(index=False))

Expected:

Compliant servers:     6
Non-compliant servers: 4

Non-compliant details:
     server_name       team auth_type has_dlp has_logging
 customer-support   Support   api_key   false       true
 analytics-export Analytics   api_key   false      false
       legacy-erp Operations    basic   false      false
   maps-geocoding  Logistics   api_key   false       true

Risk Alert

4 of 10 servers are non-compliant β€” that's 40% of the fleet. The legacy-erp server is the worst offender: basic auth, no DLP, no logging, and the highest error rate.


Step 4: Authentication Gap AnalysisΒΆ

Identify servers that are not using OAuth 2.0:

non_oauth = servers[servers["auth_type"] != "oauth2"]
print(f"Servers without OAuth 2.0: {len(non_oauth)}")
print(non_oauth[["server_name", "auth_type", "monthly_calls"]].to_string(index=False))

total_non_oauth_calls = non_oauth["monthly_calls"].sum()
total_calls = servers["monthly_calls"].sum()
pct = total_non_oauth_calls / total_calls * 100
print(f"\nNon-OAuth call volume: {total_non_oauth_calls:,} / {total_calls:,} ({pct:.1f}%)")

Expected:

Servers without OAuth 2.0: 4

     server_name auth_type  monthly_calls
 customer-support   api_key         28000
 analytics-export   api_key         12000
       legacy-erp     basic          8000
   maps-geocoding   api_key         22000

Non-OAuth call volume: 70,000 / 194,500 (36.0%)

36% of All MCP Calls Use Weak Authentication

Over a third of monthly API calls go through servers with API keys or basic auth. A single leaked key could expose customer support data, analytics exports, ERP records, or geocoding services.


Step 5: DLP Coverage AnalysisΒΆ

Check which servers lack data-loss-prevention scanning:

no_dlp = servers[servers["has_dlp"].astype(str).str.lower() == "false"]
print(f"Servers without DLP: {len(no_dlp)}")
print(no_dlp[["server_name", "team", "monthly_calls"]].to_string(index=False))

Expected:

Servers without DLP: 4

     server_name       team  monthly_calls
 customer-support   Support         28000
 analytics-export Analytics         12000
       legacy-erp Operations          8000
   maps-geocoding  Logistics         22000

The 4 servers without DLP handle 70,000 monthly calls β€” any of these could leak PII or sensitive data through tool outputs without detection.


Step 6: Error Rate and Latency AnalysisΒΆ

Identify servers with the highest error rates and latency:

print("Error rates (sorted):")
error_sorted = servers.sort_values("error_rate_pct", ascending=False)
print(error_sorted[["server_name", "error_rate_pct", "avg_latency_ms"]].to_string(index=False))

highest_error = error_sorted.iloc[0]
print(f"\nHighest error rate: {highest_error['server_name']} at {highest_error['error_rate_pct']}%")
print(f"Its average latency: {highest_error['avg_latency_ms']}ms")

Expected:

Highest error rate: legacy-erp at 5.8%
Its average latency: 450ms

Insight

The legacy-erp server stands out as the highest-risk server: basic auth, no DLP, no logging, highest error rate (5.8%), and highest latency (450ms). This should be the top priority for APIM onboarding.


Step 7: Total Call VolumeΒΆ

Calculate the total monthly calls across all MCP servers:

total = servers["monthly_calls"].sum()
print(f"Total monthly calls across fleet: {total:,}")

Expected:

Total monthly calls across fleet: 194,500

Step 8: APIM Migration PriorityΒΆ

Create a prioritized migration plan based on risk:

servers["risk_score"] = (
    (servers["auth_type"] != "oauth2").astype(int) * 3 +
    (servers["has_dlp"].astype(str).str.lower() == "false").astype(int) * 2 +
    (servers["has_logging"].astype(str).str.lower() == "false").astype(int) * 1 +
    servers["error_rate_pct"] / servers["error_rate_pct"].max()
)

priority = servers.sort_values("risk_score", ascending=False)
print("Migration Priority:")
print(priority[["server_name", "auth_type", "has_dlp", "has_logging", "risk_score"]]
      .head(5).to_string(index=False))

This produces a risk-ranked list to guide the APIM onboarding sequence.


πŸ› Bug-Fix ExerciseΒΆ

The file lab-064/broken_apim.py has 3 bugs in how it analyzes the MCP server fleet:

python lab-064/broken_apim.py
Test What it checks Hint
Test 1 Count non-compliant servers Should count compliant == False, not True
Test 2 Total monthly calls Should be the sum, not the average
Test 3 Servers without OAuth Should filter auth_type != "oauth2", not == "oauth2"

🧠 Knowledge Check¢

Q1 (Multiple Choice): Why is APIM the recommended approach for securing MCP servers at scale?
  • A) It replaces MCP with a different protocol
  • B) It provides centralized authentication, throttling, and monitoring across all MCP endpoints
  • C) It eliminates the need for OAuth 2.0
  • D) It only works with Azure-hosted MCP servers
βœ… Reveal Answer

Correct: B) It provides centralized authentication, throttling, and monitoring across all MCP endpoints

APIM acts as a unified gateway in front of all MCP servers, enforcing consistent OAuth 2.0 validation, rate limiting, DLP content inspection, and audit logging β€” regardless of how each individual MCP server was originally built. Without APIM, each team implements (or skips) these controls independently.

Q2 (Multiple Choice): Why is API key authentication insufficient for production MCP servers?
  • A) API keys are too long to store securely
  • B) API keys provide no user identity, no token expiry, and no scope control
  • C) API keys only work with REST APIs, not MCP
  • D) API keys require Azure AD to function
βœ… Reveal Answer

Correct: B) API keys provide no user identity, no token expiry, and no scope control

API keys are static secrets: if one leaks, anyone can use it indefinitely until manually rotated. They carry no information about who is calling or what they're allowed to do. OAuth 2.0 tokens expire automatically, embed user/app identity claims, and can be scoped to specific permissions (e.g., read-only access to a specific tool).

Q3 (Run the Lab): How many MCP servers in the fleet are non-compliant?

Filter the servers DataFrame for compliant == False and count the rows.

βœ… Reveal Answer

4 non-compliant servers

The non-compliant servers are: customer-support (api_key, no DLP), analytics-export (api_key, no DLP, no logging), legacy-erp (basic auth, no DLP, no logging), and maps-geocoding (api_key, no DLP). All 4 lack OAuth and DLP; 2 also lack logging.

Q4 (Run the Lab): What is the total monthly call volume across all 10 MCP servers?

Sum the monthly_calls column across all servers.

βœ… Reveal Answer

194,500 total monthly calls

45,000 + 32,000 + 28,000 + 18,000 + 15,000 + 12,000 + 5,000 + 8,000 + 22,000 + 9,500 = 194,500. Of these, 70,000 (36%) go through servers without OAuth 2.0 β€” a significant security exposure.

Q5 (Run the Lab): Which MCP server has the highest error rate, and what is it?

Sort servers by error_rate_pct descending and inspect the top row.

βœ… Reveal Answer

legacy-erp at 5.8%

The legacy-erp server (Operations team) has the highest error rate at 5.8%, nearly 3Γ— the next highest (payment-gateway at 2.1%). Combined with basic auth, no DLP, no logging, and 450ms average latency, it is the highest-risk server in the fleet and should be the top priority for APIM onboarding.


SummaryΒΆ

Topic What You Learned
APIM as Gateway Centralized security, rate limiting, and monitoring for MCP
OAuth 2.0 Token-based auth with identity, expiry, and scope control
DLP Policies Content inspection to prevent PII/sensitive data leakage
Compliance Audit Systematic assessment of fleet security posture
Risk Prioritization Data-driven migration planning based on auth, DLP, and error rates

Next StepsΒΆ

  • Lab 012 β€” What Is MCP? (foundational MCP concepts)
  • Lab 028 β€” Deploy MCP to Azure (deploy the servers that APIM protects)
  • Lab 036 β€” Prompt Injection Security (complementary security layer)