Lab 078: Foundry Local β Run AI Models OfflineΒΆ
What You'll LearnΒΆ
- What Foundry Local is and how it enables offline AI model inference
- How to install and run models with
wingetand thefoundryCLI - How the OpenAI-compatible API makes Foundry Local a drop-in replacement
- Analyze a model catalog of 8 models β comparing sizes, hardware requirements, and quality
- Identify the smallest model and which models support CPU-only inference
IntroductionΒΆ
Foundry Local is Microsoft's local inference runtime that lets you run AI models entirely on your own hardware β no cloud, no API keys, no internet required. It's a free alternative to Ollama, optimized for Windows with DirectML GPU acceleration.
InstallationΒΆ
Running a ModelΒΆ
This downloads the model (if needed) and starts a local server with an OpenAI-compatible API at http://localhost:5273:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:5273/v1", api_key="not-needed")
response = client.chat.completions.create(
model="phi-4-mini",
messages=[{"role": "user", "content": "Explain quantum computing in 2 sentences."}]
)
print(response.choices[0].message.content)
The ScenarioΒΆ
You are a DevOps Engineer evaluating Foundry Local for air-gapped (offline) deployments. You have a catalog of 8 models (foundry_models.csv) with size, hardware requirements, and quality benchmarks. Your job: analyze the catalog, find the best model for different hardware profiles, and build a deployment recommendation.
Mock Data
This lab uses a mock model catalog CSV. The model names and sizes are representative of the models available in Foundry Local's catalog as of early 2026.
PrerequisitesΒΆ
| Requirement | Why |
|---|---|
| Python 3.10+ | Run the analysis scripts |
pandas library |
Data manipulation |
π¦ Supporting FilesΒΆ
Download these files before starting the lab
Save all files to a lab-078/ folder in your working directory.
| File | Description | Download |
|---|---|---|
broken_foundry_local.py |
Bug-fix exercise (3 bugs + self-tests) | π₯ Download |
foundry_models.csv |
8-model catalog with sizes, hardware, and quality scores | π₯ Download |
Step 1: Understand the Model CatalogΒΆ
Each model in the catalog has these attributes:
| Column | Description |
|---|---|
| model_name | Model identifier (e.g., phi-4-mini, qwen2.5-0.5b) |
| size_gb | Download size in gigabytes |
| parameters | Number of model parameters (e.g., 3.8B, 0.5B) |
| hardware | Required hardware: cpu_only, gpu_recommended, or gpu_required |
| quality_score | Benchmark quality score (0.0β1.0) |
| use_case | Primary use case: chat, coding, embedding, or general |
| quantization | Quantization level: q4, q8, or fp16 |
Step 2: Load and Explore the CatalogΒΆ
import pandas as pd
df = pd.read_csv("lab-078/foundry_models.csv")
print(f"Total models: {len(df)}")
print(f"Hardware requirements: {df['hardware'].value_counts().to_dict()}")
print(f"Use cases: {df['use_case'].value_counts().to_dict()}")
print(f"\nFull catalog:")
print(df[["model_name", "size_gb", "parameters", "hardware", "quality_score"]].to_string(index=False))
Expected output:
Total models: 8
Hardware requirements: {'gpu_recommended': 4, 'cpu_only': 2, 'gpu_required': 2}
Use cases: {'chat': 3, 'coding': 2, 'general': 2, 'embedding': 1}
Step 3: Find the Smallest ModelΒΆ
smallest = df.loc[df["size_gb"].idxmin()]
largest = df.loc[df["size_gb"].idxmax()]
print(f"Smallest model: {smallest['model_name']} ({smallest['size_gb']} GB)")
print(f" Parameters: {smallest['parameters']}")
print(f" Hardware: {smallest['hardware']}")
print(f" Quality: {smallest['quality_score']}")
print(f"\nLargest model: {largest['model_name']} ({largest['size_gb']} GB)")
print(f" Parameters: {largest['parameters']}")
print(f" Hardware: {largest['hardware']}")
print(f" Quality: {largest['quality_score']}")
print(f"\nSize range: {smallest['size_gb']} GB β {largest['size_gb']} GB")
Expected output:
Edge Deployment
qwen2.5-0.5b at just 0.4 GB is ideal for edge devices, IoT gateways, or machines with minimal storage. Despite its small size, it handles basic chat and summarization tasks reasonably well.
Step 4: Identify CPU-Only ModelsΒΆ
For air-gapped machines without GPUs:
cpu_models = df[df["hardware"] == "cpu_only"]
print(f"CPU-only models: {len(cpu_models)}\n")
for _, row in cpu_models.iterrows():
print(f" {row['model_name']:>20s} size={row['size_gb']}GB quality={row['quality_score']} use_case={row['use_case']}")
# Compare CPU-only vs GPU models
gpu_models = df[df["hardware"] != "cpu_only"]
print(f"\nCPU-only avg quality: {cpu_models['quality_score'].mean():.2f}")
print(f"GPU models avg quality: {gpu_models['quality_score'].mean():.2f}")
print(f"Quality gap: {(gpu_models['quality_score'].mean() - cpu_models['quality_score'].mean()) * 100:.1f}pp")
Quality Trade-off
CPU-only models are smaller and run anywhere, but their quality scores are typically lower than GPU models. For production use cases requiring high accuracy, prefer GPU-recommended models with at least 4 GB VRAM.
Step 5: Analyze by Use CaseΒΆ
print("Models by use case:\n")
for use_case, group in df.groupby("use_case"):
print(f" {use_case.upper()} ({len(group)} models):")
for _, row in group.iterrows():
print(f" {row['model_name']:>20s} {row['size_gb']}GB quality={row['quality_score']}")
print()
# Best model per use case
print("Best model per use case (by quality):")
for use_case, group in df.groupby("use_case"):
best = group.loc[group["quality_score"].idxmax()]
print(f" {use_case:>10s}: {best['model_name']} (quality={best['quality_score']}, size={best['size_gb']}GB)")
Step 6: Build the Deployment RecommendationΒΆ
report = f"""# π Foundry Local Deployment Recommendation
## Catalog Summary
| Metric | Value |
|--------|-------|
| Total Models | {len(df)} |
| CPU-Only | {len(cpu_models)} |
| GPU Recommended | {len(df[df['hardware'] == 'gpu_recommended'])} |
| GPU Required | {len(df[df['hardware'] == 'gpu_required'])} |
| Smallest | {smallest['model_name']} ({smallest['size_gb']} GB) |
| Largest | {largest['model_name']} ({largest['size_gb']} GB) |
## Hardware Profiles
### Profile A: Edge Device (CPU only, 2 GB storage)
"""
for _, row in cpu_models.iterrows():
report += f"- **{row['model_name']}** β {row['size_gb']} GB, quality {row['quality_score']}\n"
report += f"""
### Profile B: Developer Laptop (GPU, 16 GB storage)
"""
for _, row in df[df["hardware"] == "gpu_recommended"].iterrows():
report += f"- **{row['model_name']}** β {row['size_gb']} GB, quality {row['quality_score']}\n"
report += f"""
### Profile C: Workstation (High-end GPU, 64 GB storage)
"""
for _, row in df[df["hardware"] == "gpu_required"].iterrows():
report += f"- **{row['model_name']}** β {row['size_gb']} GB, quality {row['quality_score']}\n"
print(report)
with open("lab-078/deployment_recommendation.md", "w") as f:
f.write(report)
print("πΎ Saved to lab-078/deployment_recommendation.md")
π Bug-Fix ExerciseΒΆ
The file lab-078/broken_foundry_local.py contains 3 bugs that produce incorrect model analysis. Can you find and fix them all?
Run the self-tests to see which ones fail:
You should see 3 failed tests. Each test corresponds to one bug:
| Test | What it checks | Hint |
|---|---|---|
| Test 1 | Smallest model name | Should find min size_gb, not max |
| Test 2 | CPU-only model count | Should filter hardware == "cpu_only", not "gpu_required" |
| Test 3 | Total model count | Should use len(df), not a hardcoded value |
Fix all 3 bugs, then re-run. When you see All passed!, you're done!
π§ Knowledge CheckΒΆ
Q1 (Multiple Choice): What makes Foundry Local different from cloud-based AI services?
- A) It only supports Microsoft models
- B) It runs AI models entirely on local hardware with no internet required
- C) It requires an Azure subscription
- D) It only works on Linux
β Reveal Answer
Correct: B) It runs AI models entirely on local hardware with no internet required
Foundry Local is a local inference runtime β models are downloaded once and run entirely offline. It uses an OpenAI-compatible API, making it a drop-in replacement for cloud endpoints. No API keys, no internet, no per-token costs.
Q2 (Multiple Choice): Why does Foundry Local use an OpenAI-compatible API?
- A) It's built by OpenAI
- B) It enables drop-in replacement β existing code that calls OpenAI APIs works without changes
- C) OpenAI requires all inference engines to use their API format
- D) It only runs OpenAI models
β Reveal Answer
Correct: B) It enables drop-in replacement β existing code that calls OpenAI APIs works without changes
By exposing the same /v1/chat/completions endpoint format, Foundry Local lets developers switch from cloud to local inference by changing only the base_url. All existing SDKs, tools, and frameworks that speak the OpenAI API format work immediately.
Q3 (Run the Lab): What is the smallest model in the catalog, and how large is it?
Run the Step 3 analysis on π₯ foundry_models.csv to find the smallest model.
β Reveal Answer
qwen2.5-0.5b at 0.4 GB
The smallest model in the catalog is qwen2.5-0.5b with only 0.4 GB download size and 0.5B parameters. It runs on CPU only and achieves a quality score of 0.52 β suitable for basic chat and edge deployments.
Q4 (Run the Lab): How many models support CPU-only inference?
Run the Step 4 analysis to filter models with hardware == "cpu_only".
β Reveal Answer
2 models
Only 2 models support CPU-only inference. These are the smallest models in the catalog, optimized with aggressive quantization (q4) to run without GPU acceleration. They're ideal for edge devices and air-gapped environments.
Q5 (Run the Lab): How many total models are available in the Foundry Local catalog?
Load the CSV and check the total row count.
β Reveal Answer
8 models
The Foundry Local catalog includes 8 models across 4 use cases: chat (3), coding (2), general (2), and embedding (1). Hardware requirements range from CPU-only to GPU-required.
SummaryΒΆ
| Topic | What You Learned |
|---|---|
| Foundry Local | Microsoft's local inference runtime β free, offline, no API keys |
| Installation | winget install Microsoft.FoundryLocal + foundry model run |
| OpenAI Compatibility | Drop-in replacement via http://localhost:5273/v1 |
| Model Catalog | 8 models from 0.4 GB to multi-GB, CPU to GPU-required |
| Smallest Model | qwen2.5-0.5b at 0.4 GB β runs on CPU, ideal for edge |
| Hardware Profiles | CPU-only (2 models), GPU-recommended (4), GPU-required (2) |
Next StepsΒΆ
- Lab 074 β Foundry Agent Service (deploy agents using Foundry models)
- Lab 071 β Context Caching (optimize local inference with prompt caching)
- Lab 038 β AI Cost Optimization (compare local vs. cloud inference costs)
- Lab 076 β Microsoft Agent Framework (use Foundry Local as the inference backend for MAF agents)