Skip to content

Lab 061: SLMs β€” Phi-4 Mini for Low-Cost Agent SkillsΒΆ

Level: L200 Path: All paths Time: ~60 min πŸ’° Cost: Free β€” Uses mock benchmark data (no API key required)

What You'll LearnΒΆ

  • How Small Language Models (SLMs) like Phi-4 Mini compare to frontier models like GPT-4o
  • When SLMs offer a better trade-off: low latency, privacy, and zero cloud cost
  • Run ONNX Runtime inference locally for agent skills (classify, extract, summarize, route, draft)
  • Analyze a 15-task benchmark comparing Phi-4 Mini vs GPT-4o on accuracy, latency, and cost
  • Identify which task types SLMs handle well β€” and where they fall short
  • Apply a privacy-first inference strategy for sensitive workloads

IntroductionΒΆ

Frontier models like GPT-4o deliver exceptional quality, but they come with latency, cost, and privacy trade-offs. Small Language Models (SLMs) like Phi-4 Mini run locally via ONNX Runtime, offering dramatically lower latency, zero cloud cost, and full data privacy β€” your data never leaves the device.

The question isn't "which model is better" β€” it's "which tasks can an SLM handle just as well?" This lab uses a 15-task benchmark to find the answer.

The BenchmarkΒΆ

You'll compare Phi-4 Mini (local) vs GPT-4o (cloud) across 15 tasks in 5 categories:

Category Count Example
Classify 3 Sentiment analysis, intent detection, topic tagging
Extract 3 Entity extraction, key-value parsing, date normalization
Summarize 3 Meeting notes, article digest, support ticket summary
Route 3 Ticket routing, escalation decision, queue assignment
Draft 3 Email reply, report paragraph, product description

PrerequisitesΒΆ

pip install pandas

This lab analyzes pre-computed benchmark results β€” no API key, GPU, or ONNX Runtime installation required. To run live inference, you would need ONNX Runtime and the Phi-4 Mini ONNX model.


Quick Start with GitHub Codespaces

Open in GitHub Codespaces

All dependencies are pre-installed in the devcontainer.

πŸ“¦ Supporting FilesΒΆ

Download these files before starting the lab

Save all files to a lab-061/ folder in your working directory.

File Description Download
broken_slm.py Bug-fix exercise (3 bugs + self-tests) πŸ“₯ Download
slm_benchmark.csv Benchmark dataset πŸ“₯ Download

Part 1: Understanding SLMsΒΆ

Step 1: SLMs vs frontier modelsΒΆ

SLMs are compact models (typically 1–4B parameters) optimized for specific task patterns. They trade breadth for efficiency:

Frontier Model (GPT-4o):
  Cloud API β†’ [Large model] β†’ High accuracy, high latency, per-token cost

Small Language Model (Phi-4 Mini):
  Local ONNX β†’ [Compact model] β†’ Good accuracy, very low latency, zero cost

Key concepts:

Concept Description
SLM Small Language Model β€” compact model optimized for specific tasks
ONNX Runtime Cross-platform inference engine for running models locally
Privacy-first inference Data never leaves the device β€” critical for PII, health, finance
Task routing Directing simple tasks to SLMs and complex tasks to frontier models

When to consider SLMs

SLMs excel at well-defined, constrained tasks like classification, extraction, and routing. They struggle with open-ended creative tasks that require broad world knowledge. The ideal architecture routes each task to the right-sized model.


Part 2: Load Benchmark DataΒΆ

Step 2: Load πŸ“₯ slm_benchmark.csvΒΆ

The benchmark dataset contains results from running all 15 tasks through both models:

# slm_analysis.py
import pandas as pd

bench = pd.read_csv("lab-061/slm_benchmark.csv")

print(f"Tasks: {len(bench)}")
print(f"Categories: {bench['category'].unique().tolist()}")
print(bench[["task_id", "category", "description"]].to_string(index=False))

Expected output:

Tasks: 15
Categories: ['classify', 'extract', 'summarize', 'route', 'draft']

task_id  category                          description
    T01  classify                   Sentiment analysis
    T02  classify                     Intent detection
    T03  classify                        Topic tagging
    T04   extract                  Entity extraction
    T05   extract                  Key-value parsing
    T06   extract                Date normalization
    T07 summarize                     Meeting notes
    T08 summarize                    Article digest
    T09 summarize            Support ticket summary
    T10     draft                       Email reply
    T11     draft                  Report paragraph
    T12     draft              Product description
    T13     route                   Ticket routing
    T14 summarize       Compliance document summary
    T15     route              Escalation decision

Part 3: Accuracy ComparisonΒΆ

Step 3: Calculate accuracy for each modelΒΆ

# Overall accuracy
for model in ["phi4_mini", "gpt4o"]:
    correct = bench[f"{model}_correct"].sum()
    total = len(bench)
    print(f"{model:>10}: {correct}/{total} = {correct/total*100:.0f}%")

Expected output:

 phi4_mini: 12/15 = 80%
     gpt4o: 15/15 = 100%

Key Finding

Phi-4 Mini achieves 80% accuracy β€” solid for most agent tasks. GPT-4o gets everything right, but at much higher latency and cost. The 3 tasks Phi-4 Mini fails on reveal where SLMs hit their limits.

# Which tasks does Phi-4 Mini get wrong?
phi4_fails = bench[bench["phi4_mini_correct"] == False]
print("Phi-4 Mini failures:")
print(phi4_fails[["task_id", "category", "description"]].to_string(index=False))

Expected output:

Phi-4 Mini failures:
task_id  category                    description
    T10     draft                     Email reply
    T11     draft                Report paragraph
    T14 summarize  Compliance document summary

Phi-4 Mini fails on 2 draft tasks (T10, T11) and 1 summarize task (T14). Draft tasks require creative, nuanced writing β€” exactly where SLMs struggle. T14 is a complex compliance document that exceeds the model's context capacity.


Part 4: Latency ComparisonΒΆ

Step 4: Compare inference latencyΒΆ

# Average latency per model
for model in ["phi4_mini", "gpt4o"]:
    avg_ms = bench[f"{model}_latency_ms"].mean()
    print(f"{model:>10}: {avg_ms:.1f}ms average")

# Speedup
phi4_avg = bench["phi4_mini_latency_ms"].mean()
gpt4o_avg = bench["gpt4o_latency_ms"].mean()
speedup = gpt4o_avg / phi4_avg
print(f"\nPhi-4 Mini is {speedup:.0f}Γ— faster than GPT-4o")

Expected output:

 phi4_mini: 82.3ms average
     gpt4o: 996.7ms average

Phi-4 Mini is 12Γ— faster than GPT-4o

Latency Advantage

Phi-4 Mini runs locally via ONNX Runtime at 82.3ms average β€” 12Γ— faster than GPT-4o's cloud round-trip of ~1 second. For agent skills that execute repeatedly (classification, routing), this latency difference compounds dramatically.

# Per-task latency comparison
print("\nPer-task latency:")
for _, row in bench.iterrows():
    print(f"  {row['task_id']} ({row['category']:>9}): "
          f"Phi-4={row['phi4_mini_latency_ms']:.0f}ms  "
          f"GPT-4o={row['gpt4o_latency_ms']:.0f}ms")

Part 5: Cost AnalysisΒΆ

Step 5: Calculate cloud cost avoidedΒΆ

# Total cloud cost for GPT-4o
total_cost = bench["gpt4o_cost_usd"].sum()
print(f"Total GPT-4o cloud cost: ${total_cost:.4f}")
print(f"Phi-4 Mini local cost:   $0.0000")
print(f"Cost avoided by using SLM: ${total_cost:.4f}")

# Cost per category
print("\nCost by category:")
for cat in bench["category"].unique():
    cat_cost = bench[bench["category"] == cat]["gpt4o_cost_usd"].sum()
    print(f"  {cat:>9}: ${cat_cost:.4f}")

Expected output:

Total GPT-4o cloud cost: $0.0121
Phi-4 Mini local cost:   $0.0000
Cost avoided by using SLM: $0.0121

Cost by category:
  classify: $0.0018
   extract: $0.0021
 summarize: $0.0035
     route: $0.0015
     draft: $0.0032

While $0.0121 seems small for 15 tasks, at scale (thousands of agent invocations per day), the savings compound rapidly β€” and the privacy benefit is priceless for sensitive data.


Part 6: Task Routing StrategyΒΆ

Step 6: Build a routing decisionΒΆ

Based on the benchmark, the optimal strategy routes tasks by category:

Category Recommended Model Why
Classify Phi-4 Mini 100% accuracy, 12Γ— faster, zero cost
Extract Phi-4 Mini 100% accuracy, 12Γ— faster, zero cost
Route Phi-4 Mini 100% accuracy, 12Γ— faster, zero cost
Summarize Phi-4 Mini (with fallback) 2/3 correct; fall back to GPT-4o for complex docs
Draft GPT-4o SLM fails on creative writing β€” use frontier model
# Summary dashboard
print("""
╔══════════════════════════════════════════════════════╗
β•‘     SLM Benchmark β€” Phi-4 Mini vs GPT-4o            β•‘
╠══════════════════════════════════════════════════════╣
β•‘  Metric              Phi-4 Mini     GPT-4o          β•‘
β•‘  ─────────────       ──────────     ──────          β•‘
β•‘  Accuracy              80%          100%            β•‘
β•‘  Avg Latency           82.3ms       996.7ms         β•‘
β•‘  Speedup               12Γ—          baseline        β•‘
β•‘  Cloud Cost             $0           $0.0121        β•‘
β•‘  Privacy                Full         Data leaves    β•‘
╠══════════════════════════════════════════════════════╣
β•‘  Route: classify/extract/route β†’ SLM                β•‘
β•‘  Route: draft β†’ frontier model                      β•‘
β•‘  Route: summarize β†’ SLM with fallback               β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
""")

πŸ› Bug-Fix ExerciseΒΆ

The file lab-061/broken_slm.py has 3 bugs in the SLM analysis functions. Run the self-tests:

python lab-061/broken_slm.py

You should see 3 failed tests:

Test What it checks Hint
Test 1 Accuracy calculation Which column represents correctness β€” _correct or _latency_ms?
Test 2 Cost calculation Are you summing _tokens or _cost_usd?
Test 3 Filtering failed tasks Are you filtering for category == "draft" or missing the filter entirely?

Fix all 3 bugs and re-run until you see πŸŽ‰ All 3 tests passed.


🧠 Knowledge Check¢

Q1 (Multiple Choice): What are the primary advantages of using an SLM like Phi-4 Mini over a frontier model like GPT-4o?
  • A) Higher accuracy on all task types
  • B) Low latency, data privacy, and zero cloud cost
  • C) Better creative writing and summarization
  • D) Larger context window and more parameters
βœ… Reveal Answer

Correct: B) Low latency, data privacy, and zero cloud cost

SLMs run locally via ONNX Runtime, delivering 12Γ— lower latency (82.3ms vs 996.7ms), keeping all data on-device for full privacy, and eliminating per-token cloud costs. They don't beat frontier models on accuracy (80% vs 100%), but for well-defined tasks like classification, extraction, and routing, the accuracy is sufficient and the operational benefits are significant.

Q2 (Multiple Choice): When should you NOT use an SLM like Phi-4 Mini?
  • A) For sentiment classification
  • B) For entity extraction
  • C) For complex creative writing tasks
  • D) For ticket routing
βœ… Reveal Answer

Correct: C) For complex creative writing tasks

The benchmark shows Phi-4 Mini fails on both draft tasks (T10: email reply, T11: report paragraph). Creative writing requires nuanced language generation, broad world knowledge, and stylistic flexibility β€” areas where SLMs lack the capacity of frontier models. Classify, extract, and route tasks are well-suited to SLMs.

Q3 (Run the Lab): What is Phi-4 Mini's accuracy on the 15-task benchmark?

Calculate bench["phi4_mini_correct"].sum() / len(bench) * 100.

βœ… Reveal Answer

80% (12/15)

Phi-4 Mini correctly handles 12 of 15 tasks. It achieves 100% accuracy on classify (3/3), extract (3/3), and route (3/3) tasks, but fails on 2 draft tasks (T10, T11) and 1 complex summarize task (T14). This 80% accuracy is sufficient for a task-routing architecture where only appropriate tasks are sent to the SLM.

Q4 (Run the Lab): How much faster is Phi-4 Mini compared to GPT-4o?

Calculate bench["gpt4o_latency_ms"].mean() / bench["phi4_mini_latency_ms"].mean().

βœ… Reveal Answer

~12Γ— faster

Phi-4 Mini averages 82.3ms per task via local ONNX Runtime inference, while GPT-4o averages 996.7ms including cloud round-trip. The ratio is 996.7 / 82.3 β‰ˆ 12Γ—. For agent pipelines that execute many skills sequentially, this latency reduction compounds β€” a 10-step agent pipeline drops from ~10 seconds to under 1 second.

Q5 (Run the Lab): How much total cloud cost is avoided by using Phi-4 Mini for all 15 tasks?

Calculate bench["gpt4o_cost_usd"].sum().

βœ… Reveal Answer

$0.0121

The total GPT-4o cloud cost across all 15 tasks is $0.0121. While this seems small, it scales linearly β€” 10,000 invocations per day would cost ~$8/day or ~$240/month. With Phi-4 Mini running locally, the cloud cost is exactly $0. The real value is often privacy rather than cost: for healthcare, finance, and legal workloads, keeping data on-device may be a compliance requirement.


SummaryΒΆ

Topic What You Learned
SLMs Compact models optimized for specific tasks β€” fast, private, free
Phi-4 Mini 80% accuracy on 15-task benchmark, 12Γ— faster than GPT-4o
ONNX Runtime Local inference engine β€” no cloud dependency
Task Routing Route classify/extract/route to SLM; draft to frontier model
Privacy SLM inference keeps all data on-device β€” critical for sensitive workloads
Cost $0.0121 cloud cost avoided per 15 tasks; compounds at scale

Next StepsΒΆ

  • Lab 062 β€” On-Device Agents with Phi Silica (NPU-accelerated on-device inference)
  • Lab 060 β€” Reasoning Models (when you need maximum accuracy over speed)
  • Lab 044 β€” Phi-4 with Ollama in Production (alternative local deployment)