Lab 062: On-Device Agents with Phi Silica β Windows AI APIsΒΆ
What You'll LearnΒΆ
- How Windows AI APIs enable on-device inference using the Neural Processing Unit (NPU)
- What Phi Silica is β a model optimized for Windows NPU hardware
- Compare NPU vs cloud latency for agent skills (summarize, classify, rewrite, text_to_table)
- Handle NPU unavailability gracefully with cloud fallback strategies
- Measure quality match rates between on-device and cloud inference
- Build agents that work offline-first with intelligent degradation
IntroductionΒΆ
Cloud-based AI is powerful, but it requires internet connectivity, introduces latency, and sends data off-device. Windows AI APIs with Phi Silica bring inference directly to the NPU (Neural Processing Unit) β a dedicated AI accelerator built into modern Windows devices.
On-device inference means: zero network latency, full data privacy, offline capability, and no per-token cost. The trade-off is that not every task can run on the NPU, and quality may differ from cloud models. This lab measures exactly where on-device inference shines and where you need cloud fallback.
The BenchmarkΒΆ
You'll analyze 15 tasks across 4 categories, comparing NPU (Phi Silica) vs cloud inference:
| Category | Count | Example |
|---|---|---|
| Summarize | 4 | Meeting transcript, article, email thread, policy doc |
| Classify | 4 | Sentiment, intent, priority, language detection |
| Rewrite | 4 | Tone adjustment, simplification, formalization, translation |
| Text-to-table | 3 | Extract structured data from unstructured text |
PrerequisitesΒΆ
This lab analyzes pre-computed benchmark results β no NPU hardware, Windows AI SDK, or C# toolchain required. To run live on-device inference, you would need a Copilot+ PC with NPU and the Windows AI APIs.
π¦ Supporting FilesΒΆ
Download these files before starting the lab
Save all files to a lab-062/ folder in your working directory.
| File | Description | Download |
|---|---|---|
broken_ondevice.py |
Bug-fix exercise (3 bugs + self-tests) | π₯ Download |
ondevice_tasks.csv |
Dataset | π₯ Download |
Part 1: Understanding On-Device InferenceΒΆ
Step 1: NPU architectureΒΆ
The Neural Processing Unit (NPU) is a dedicated AI accelerator designed for efficient matrix operations:
Cloud Inference:
App β [Network] β [Cloud GPU] β [Network] β Response
Latency: ~800-1200ms
NPU Inference (Phi Silica):
App β [Local NPU] β Response
Latency: ~50-120ms
Key concepts:
| Concept | Description |
|---|---|
| NPU | Neural Processing Unit β dedicated AI hardware in modern CPUs |
| Phi Silica | Microsoft's model optimized for Windows NPU execution |
| Windows AI APIs | System-level APIs for on-device AI inference |
| Readiness check | API to verify NPU availability before attempting inference |
| Graceful fallback | Strategy to fall back to cloud when NPU is unavailable |
Phi Silica vs Phi-4 Mini
Phi Silica is specifically optimized for Windows NPU hardware β it's not just a smaller model, but one designed for the NPU's architecture. Phi-4 Mini (Lab 061) runs via ONNX Runtime on CPU/GPU. Both offer on-device inference but target different hardware paths.
Part 2: Load Benchmark DataΒΆ
Step 2: Load π₯ ondevice_tasks.csvΒΆ
The benchmark dataset contains results from running 15 tasks through NPU and cloud inference:
# ondevice_analysis.py
import pandas as pd
bench = pd.read_csv("lab-062/ondevice_tasks.csv")
print(f"Tasks: {len(bench)}")
print(f"Categories: {bench['category'].unique().tolist()}")
print(bench[["task_id", "category", "description", "npu_available"]].to_string(index=False))
Expected output:
Tasks: 15
Categories: ['summarize', 'classify', 'rewrite', 'text_to_table']
task_id category description npu_available
T01 summarize Meeting transcript summary True
T02 summarize Article digest True
T03 summarize Email thread summary True
T04 summarize Policy doc summary True
T05 classify Sentiment analysis True
T06 classify Intent detection True
T07 classify Priority assignment True
T08 classify Language detection True
T09 rewrite Tone adjustment True
T10 rewrite Simplification True
T11 rewrite Formalization True
T12 rewrite Translation (ENβES snippet) False
T13 text_to_table Invoice data extraction True
T14 text_to_table Resume parsing to table True
T15 text_to_table Schedule extraction to table True
Part 3: NPU AvailabilityΒΆ
Step 3: Check NPU readiness across tasksΒΆ
# NPU availability
available = bench["npu_available"].sum()
unavailable = len(bench) - available
print(f"NPU available: {available}/{len(bench)}")
print(f"NPU unavailable: {unavailable}")
# Which tasks have no NPU support?
no_npu = bench[bench["npu_available"] == False]
print("\nTasks without NPU support:")
print(no_npu[["task_id", "category", "description"]].to_string(index=False))
Expected output:
NPU available: 14/15
NPU unavailable: 1
Tasks without NPU support:
task_id category description
T12 rewrite Translation (ENβES snippet)
NPU Limitation
Translation (T12) is not available on the NPU β Phi Silica is optimized for English-language tasks and does not support cross-language translation on-device. Your agent must detect this and fall back to cloud inference.
Part 4: Quality Match AnalysisΒΆ
Step 4: Compare NPU vs cloud qualityΒΆ
# Quality match for NPU-available tasks only
npu_tasks = bench[bench["npu_available"] == True]
quality_match = npu_tasks["quality_match"].sum()
total_available = len(npu_tasks)
match_rate = quality_match / total_available * 100
print(f"Quality match (NPU-available tasks): {quality_match}/{total_available} = {match_rate:.0f}%")
# Which NPU-available tasks have quality mismatch?
mismatches = npu_tasks[npu_tasks["quality_match"] == False]
print("\nQuality mismatches (NPU available but lower quality):")
print(mismatches[["task_id", "category", "description"]].to_string(index=False))
Expected output:
Quality match (NPU-available tasks): 13/14 = 93%
Quality mismatches (NPU available but lower quality):
task_id category description
T04 summarize Policy doc summary
Quality Insight
93% of NPU-available tasks match cloud quality. The only mismatch is T04 (policy document summary) β a complex document that pushes the on-device model's context limits. For 13 of 14 available tasks, NPU quality is indistinguishable from cloud.
# Quality by category (NPU-available tasks only)
print("\nQuality match by category:")
for cat in npu_tasks["category"].unique():
cat_data = npu_tasks[npu_tasks["category"] == cat]
matches = cat_data["quality_match"].sum()
total = len(cat_data)
print(f" {cat:>13}: {matches}/{total}")
Expected output:
Part 5: Latency ComparisonΒΆ
Step 5: NPU vs cloud latencyΒΆ
# Average NPU latency (available tasks only)
npu_tasks = bench[bench["npu_available"] == True]
npu_avg = npu_tasks["npu_latency_ms"].mean()
cloud_avg = npu_tasks["cloud_latency_ms"].mean()
speedup = cloud_avg / npu_avg
print(f"NPU avg latency: {npu_avg:.1f}ms")
print(f"Cloud avg latency: {cloud_avg:.1f}ms")
print(f"Speedup: {speedup:.0f}Γ")
Expected output:
# Per-task latency comparison
print("\nPer-task latency (NPU-available only):")
for _, row in npu_tasks.iterrows():
print(f" {row['task_id']} ({row['category']:>13}): "
f"NPU={row['npu_latency_ms']:.0f}ms "
f"Cloud={row['cloud_latency_ms']:.0f}ms")
Latency Advantage
NPU inference averages 83.1ms β over 10Γ faster than cloud at 874.3ms. This is even faster than CPU-based ONNX Runtime (Lab 061's 82.3ms) because the NPU is purpose-built for AI workloads. For real-time agent experiences, this sub-100ms latency enables truly responsive interactions.
Part 6: Graceful Fallback StrategyΒΆ
Step 6: Implement fallback logicΒΆ
The correct pattern for on-device agents is: check readiness β attempt NPU β fall back to cloud:
// C# β Windows AI API pattern
async Task<string> RunAgentSkill(string input, SkillType skill)
{
// 1. Check NPU readiness for this skill
var readiness = await PhiSilicaModel.CheckReadinessAsync(skill);
if (readiness == AIReadiness.Available)
{
// 2. Run on NPU
return await PhiSilicaModel.InferAsync(input, skill);
}
else
{
// 3. Fall back to cloud
Console.WriteLine($"NPU unavailable for {skill}, falling back to cloud");
return await CloudModel.InferAsync(input, skill);
}
}
Anti-pattern: No Readiness Check
Never assume the NPU is available. Always call CheckReadinessAsync() first. Some tasks (like translation) are not supported on-device, and NPU availability can change based on hardware and driver state.
# Simulate fallback strategy
print("Fallback strategy simulation:")
for _, row in bench.iterrows():
if row["npu_available"]:
engine = "NPU"
latency = row["npu_latency_ms"]
else:
engine = "CLOUD (fallback)"
latency = row["cloud_latency_ms"]
print(f" {row['task_id']}: {engine:>20} β {latency:.0f}ms")
Part 7: Decision FrameworkΒΆ
Step 7: When to use on-device inferenceΒΆ
| Scenario | Recommended | Why |
|---|---|---|
| Offline operation | NPU | No internet required |
| Privacy-sensitive data | NPU | Data never leaves device |
| Real-time agent UX | NPU | Sub-100ms latency |
| Translation | Cloud | NPU doesn't support cross-language |
| Complex documents | Cloud (or NPU with fallback) | NPU may have quality gaps on complex inputs |
| Batch processing | NPU | Zero per-token cost at scale |
# Summary dashboard
print("""
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β On-Device Benchmark β Phi Silica (NPU) vs Cloud β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β Metric NPU Cloud β
β βββββββββββββββββ βββ βββββ β
β Tasks supported 14/15 15/15 β
β Quality match (avail.) 93% baseline β
β Avg latency 83.1ms 874.3ms β
β Speedup 10Γ+ baseline β
β Privacy Full Data sent β
β Offline capable Yes No β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β Strategy: NPU-first with cloud fallback β
β Check readiness β attempt NPU β fall back if neededβ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
""")
π Bug-Fix ExerciseΒΆ
The file lab-062/broken_ondevice.py has 3 bugs in the on-device analysis functions. Run the self-tests:
You should see 3 failed tests:
| Test | What it checks | Hint |
|---|---|---|
| Test 1 | NPU availability count | Which column represents availability β npu_available or quality_match? |
| Test 2 | Speedup calculation | Is the ratio npu / cloud or cloud / npu? |
| Test 3 | Quality match filter | Are you filtering for npu_available == True before checking quality? |
Fix all 3 bugs and re-run until you see π All 3 tests passed.
π§ Knowledge CheckΒΆ
Q1 (Multiple Choice): What is the primary advantage of NPU-based inference with Phi Silica?
- A) Higher accuracy than all cloud models
- B) Fast inference without internet connectivity
- C) Support for all languages and modalities
- D) Unlimited context window size
β Reveal Answer
Correct: B) Fast inference without internet connectivity
The NPU enables on-device inference at ~83ms average β no network round-trip, no internet dependency, and full data privacy. It doesn't claim higher accuracy than cloud models (quality match is 93%), and it has limitations (e.g., no translation support). The key advantage is the combination of speed, privacy, and offline capability.
Q2 (Multiple Choice): What is the correct pattern for handling NPU unavailability in a production agent?
- A) Crash with an error message telling the user to upgrade hardware
- B) Always use cloud inference to avoid NPU issues entirely
- C) Check NPU readiness first, then fall back to cloud if unavailable
- D) Retry NPU inference 10 times before giving up
β Reveal Answer
Correct: C) Check NPU readiness first, then fall back to cloud if unavailable
The correct pattern is: check readiness β attempt NPU β fall back to cloud. This ensures the agent works on all hardware configurations and for all task types. Some tasks (like translation) are never available on NPU, and hardware availability can vary. A graceful fallback provides the best user experience β fast on-device when possible, reliable cloud when needed.
Q3 (Run the Lab): How many tasks have NPU unavailable?
Calculate (bench["npu_available"] == False).sum().
β Reveal Answer
1 task (T12 β Translation)
Only T12 (Translation ENβES snippet) lacks NPU support. All other 14 tasks β summarize, classify, rewrite, and text_to_table β can run on the NPU via Phi Silica. This means 93% of the benchmark tasks can run entirely on-device.
Q4 (Run the Lab): What is the quality match rate for NPU-available tasks?
Filter for npu_available == True, then calculate quality_match.sum() / len(filtered) * 100.
β Reveal Answer
93% (13/14)
Of the 14 tasks where NPU is available, 13 produce quality that matches cloud inference β a 93% match rate. The only mismatch is T04 (policy document summary), where the complex document exceeds the on-device model's effective context capacity. For the vast majority of tasks, on-device quality is indistinguishable from cloud.
Q5 (Run the Lab): What is the average NPU latency for available tasks?
Filter for npu_available == True, then calculate npu_latency_ms.mean().
β Reveal Answer
83.1ms
The average NPU latency across 14 available tasks is 83.1ms. Compared to the cloud average of 874.3ms, this represents a 10Γ+ speedup. Sub-100ms latency enables real-time agent interactions β the user perceives the response as instant. This latency advantage is the strongest argument for on-device inference in interactive agent experiences.
SummaryΒΆ
| Topic | What You Learned |
|---|---|
| Windows AI APIs | System-level APIs for on-device NPU inference |
| Phi Silica | Model optimized for Windows NPU hardware |
| NPU Availability | 14/15 tasks supported; translation requires cloud fallback |
| Quality Match | 93% of NPU-available tasks match cloud quality |
| Latency | NPU avg 83.1ms vs cloud 874.3ms β 10Γ+ faster |
| Fallback Pattern | Check readiness β NPU β cloud fallback |