Skip to content

Lab 078: Foundry Local β€” Run AI Models OfflineΒΆ

Level: L100 Path: All paths Time: ~45 min πŸ’° Cost: Free β€” Runs entirely on local hardware

What You'll LearnΒΆ

  • What Foundry Local is and how it enables offline AI model inference
  • How to install and run models with winget and the foundry CLI
  • How the OpenAI-compatible API makes Foundry Local a drop-in replacement
  • Analyze a model catalog of 8 models β€” comparing sizes, hardware requirements, and quality
  • Identify the smallest model and which models support CPU-only inference

IntroductionΒΆ

Foundry Local is Microsoft's local inference runtime that lets you run AI models entirely on your own hardware β€” no cloud, no API keys, no internet required. It's a free alternative to Ollama, optimized for Windows with DirectML GPU acceleration.

InstallationΒΆ

winget install Microsoft.FoundryLocal

Running a ModelΒΆ

foundry model run phi-4-mini

This downloads the model (if needed) and starts a local server with an OpenAI-compatible API at http://localhost:5273:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:5273/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="phi-4-mini",
    messages=[{"role": "user", "content": "Explain quantum computing in 2 sentences."}]
)
print(response.choices[0].message.content)

The ScenarioΒΆ

You are a DevOps Engineer evaluating Foundry Local for air-gapped (offline) deployments. You have a catalog of 8 models (foundry_models.csv) with size, hardware requirements, and quality benchmarks. Your job: analyze the catalog, find the best model for different hardware profiles, and build a deployment recommendation.

Mock Data

This lab uses a mock model catalog CSV. The model names and sizes are representative of the models available in Foundry Local's catalog as of early 2026.

PrerequisitesΒΆ

Requirement Why
Python 3.10+ Run the analysis scripts
pandas library Data manipulation
pip install pandas

Quick Start with GitHub Codespaces

Open in GitHub Codespaces

All dependencies are pre-installed in the devcontainer.

πŸ“¦ Supporting FilesΒΆ

Download these files before starting the lab

Save all files to a lab-078/ folder in your working directory.

File Description Download
broken_foundry_local.py Bug-fix exercise (3 bugs + self-tests) πŸ“₯ Download
foundry_models.csv 8-model catalog with sizes, hardware, and quality scores πŸ“₯ Download

Step 1: Understand the Model CatalogΒΆ

Each model in the catalog has these attributes:

Column Description
model_name Model identifier (e.g., phi-4-mini, qwen2.5-0.5b)
size_gb Download size in gigabytes
parameters Number of model parameters (e.g., 3.8B, 0.5B)
hardware Required hardware: cpu_only, gpu_recommended, or gpu_required
quality_score Benchmark quality score (0.0–1.0)
use_case Primary use case: chat, coding, embedding, or general
quantization Quantization level: q4, q8, or fp16

Step 2: Load and Explore the CatalogΒΆ

import pandas as pd

df = pd.read_csv("lab-078/foundry_models.csv")

print(f"Total models: {len(df)}")
print(f"Hardware requirements: {df['hardware'].value_counts().to_dict()}")
print(f"Use cases: {df['use_case'].value_counts().to_dict()}")
print(f"\nFull catalog:")
print(df[["model_name", "size_gb", "parameters", "hardware", "quality_score"]].to_string(index=False))

Expected output:

Total models: 8
Hardware requirements: {'gpu_recommended': 4, 'cpu_only': 2, 'gpu_required': 2}
Use cases: {'chat': 3, 'coding': 2, 'general': 2, 'embedding': 1}

Step 3: Find the Smallest ModelΒΆ

smallest = df.loc[df["size_gb"].idxmin()]
largest = df.loc[df["size_gb"].idxmax()]

print(f"Smallest model: {smallest['model_name']} ({smallest['size_gb']} GB)")
print(f"  Parameters: {smallest['parameters']}")
print(f"  Hardware: {smallest['hardware']}")
print(f"  Quality: {smallest['quality_score']}")

print(f"\nLargest model: {largest['model_name']} ({largest['size_gb']} GB)")
print(f"  Parameters: {largest['parameters']}")
print(f"  Hardware: {largest['hardware']}")
print(f"  Quality: {largest['quality_score']}")

print(f"\nSize range: {smallest['size_gb']} GB – {largest['size_gb']} GB")

Expected output:

Smallest model: qwen2.5-0.5b (0.4 GB)
  Parameters: 0.5B
  Hardware: cpu_only
  Quality: 0.52

Edge Deployment

qwen2.5-0.5b at just 0.4 GB is ideal for edge devices, IoT gateways, or machines with minimal storage. Despite its small size, it handles basic chat and summarization tasks reasonably well.


Step 4: Identify CPU-Only ModelsΒΆ

For air-gapped machines without GPUs:

cpu_models = df[df["hardware"] == "cpu_only"]
print(f"CPU-only models: {len(cpu_models)}\n")
for _, row in cpu_models.iterrows():
    print(f"  {row['model_name']:>20s}  size={row['size_gb']}GB  quality={row['quality_score']}  use_case={row['use_case']}")
# Compare CPU-only vs GPU models
gpu_models = df[df["hardware"] != "cpu_only"]
print(f"\nCPU-only avg quality: {cpu_models['quality_score'].mean():.2f}")
print(f"GPU models avg quality: {gpu_models['quality_score'].mean():.2f}")
print(f"Quality gap: {(gpu_models['quality_score'].mean() - cpu_models['quality_score'].mean()) * 100:.1f}pp")

Quality Trade-off

CPU-only models are smaller and run anywhere, but their quality scores are typically lower than GPU models. For production use cases requiring high accuracy, prefer GPU-recommended models with at least 4 GB VRAM.


Step 5: Analyze by Use CaseΒΆ

print("Models by use case:\n")
for use_case, group in df.groupby("use_case"):
    print(f"  {use_case.upper()} ({len(group)} models):")
    for _, row in group.iterrows():
        print(f"    {row['model_name']:>20s}  {row['size_gb']}GB  quality={row['quality_score']}")
    print()

# Best model per use case
print("Best model per use case (by quality):")
for use_case, group in df.groupby("use_case"):
    best = group.loc[group["quality_score"].idxmax()]
    print(f"  {use_case:>10s}: {best['model_name']} (quality={best['quality_score']}, size={best['size_gb']}GB)")

Step 6: Build the Deployment RecommendationΒΆ

report = f"""# πŸ“‹ Foundry Local Deployment Recommendation

## Catalog Summary
| Metric | Value |
|--------|-------|
| Total Models | {len(df)} |
| CPU-Only | {len(cpu_models)} |
| GPU Recommended | {len(df[df['hardware'] == 'gpu_recommended'])} |
| GPU Required | {len(df[df['hardware'] == 'gpu_required'])} |
| Smallest | {smallest['model_name']} ({smallest['size_gb']} GB) |
| Largest | {largest['model_name']} ({largest['size_gb']} GB) |

## Hardware Profiles

### Profile A: Edge Device (CPU only, 2 GB storage)
"""

for _, row in cpu_models.iterrows():
    report += f"- **{row['model_name']}** β€” {row['size_gb']} GB, quality {row['quality_score']}\n"

report += f"""
### Profile B: Developer Laptop (GPU, 16 GB storage)
"""

for _, row in df[df["hardware"] == "gpu_recommended"].iterrows():
    report += f"- **{row['model_name']}** β€” {row['size_gb']} GB, quality {row['quality_score']}\n"

report += f"""
### Profile C: Workstation (High-end GPU, 64 GB storage)
"""

for _, row in df[df["hardware"] == "gpu_required"].iterrows():
    report += f"- **{row['model_name']}** β€” {row['size_gb']} GB, quality {row['quality_score']}\n"

print(report)

with open("lab-078/deployment_recommendation.md", "w") as f:
    f.write(report)
print("πŸ’Ύ Saved to lab-078/deployment_recommendation.md")

πŸ› Bug-Fix ExerciseΒΆ

The file lab-078/broken_foundry_local.py contains 3 bugs that produce incorrect model analysis. Can you find and fix them all?

Run the self-tests to see which ones fail:

python lab-078/broken_foundry_local.py

You should see 3 failed tests. Each test corresponds to one bug:

Test What it checks Hint
Test 1 Smallest model name Should find min size_gb, not max
Test 2 CPU-only model count Should filter hardware == "cpu_only", not "gpu_required"
Test 3 Total model count Should use len(df), not a hardcoded value

Fix all 3 bugs, then re-run. When you see All passed!, you're done!


🧠 Knowledge Check¢

Q1 (Multiple Choice): What makes Foundry Local different from cloud-based AI services?
  • A) It only supports Microsoft models
  • B) It runs AI models entirely on local hardware with no internet required
  • C) It requires an Azure subscription
  • D) It only works on Linux
βœ… Reveal Answer

Correct: B) It runs AI models entirely on local hardware with no internet required

Foundry Local is a local inference runtime β€” models are downloaded once and run entirely offline. It uses an OpenAI-compatible API, making it a drop-in replacement for cloud endpoints. No API keys, no internet, no per-token costs.

Q2 (Multiple Choice): Why does Foundry Local use an OpenAI-compatible API?
  • A) It's built by OpenAI
  • B) It enables drop-in replacement β€” existing code that calls OpenAI APIs works without changes
  • C) OpenAI requires all inference engines to use their API format
  • D) It only runs OpenAI models
βœ… Reveal Answer

Correct: B) It enables drop-in replacement β€” existing code that calls OpenAI APIs works without changes

By exposing the same /v1/chat/completions endpoint format, Foundry Local lets developers switch from cloud to local inference by changing only the base_url. All existing SDKs, tools, and frameworks that speak the OpenAI API format work immediately.

Q3 (Run the Lab): What is the smallest model in the catalog, and how large is it?

Run the Step 3 analysis on πŸ“₯ foundry_models.csv to find the smallest model.

βœ… Reveal Answer

qwen2.5-0.5b at 0.4 GB

The smallest model in the catalog is qwen2.5-0.5b with only 0.4 GB download size and 0.5B parameters. It runs on CPU only and achieves a quality score of 0.52 β€” suitable for basic chat and edge deployments.

Q4 (Run the Lab): How many models support CPU-only inference?

Run the Step 4 analysis to filter models with hardware == "cpu_only".

βœ… Reveal Answer

2 models

Only 2 models support CPU-only inference. These are the smallest models in the catalog, optimized with aggressive quantization (q4) to run without GPU acceleration. They're ideal for edge devices and air-gapped environments.

Q5 (Run the Lab): How many total models are available in the Foundry Local catalog?

Load the CSV and check the total row count.

βœ… Reveal Answer

8 models

The Foundry Local catalog includes 8 models across 4 use cases: chat (3), coding (2), general (2), and embedding (1). Hardware requirements range from CPU-only to GPU-required.


SummaryΒΆ

Topic What You Learned
Foundry Local Microsoft's local inference runtime β€” free, offline, no API keys
Installation winget install Microsoft.FoundryLocal + foundry model run
OpenAI Compatibility Drop-in replacement via http://localhost:5273/v1
Model Catalog 8 models from 0.4 GB to multi-GB, CPU to GPU-required
Smallest Model qwen2.5-0.5b at 0.4 GB β€” runs on CPU, ideal for edge
Hardware Profiles CPU-only (2 models), GPU-recommended (4), GPU-required (2)

Next StepsΒΆ

  • Lab 074 β€” Foundry Agent Service (deploy agents using Foundry models)
  • Lab 071 β€” Context Caching (optimize local inference with prompt caching)
  • Lab 038 β€” AI Cost Optimization (compare local vs. cloud inference costs)
  • Lab 076 β€” Microsoft Agent Framework (use Foundry Local as the inference backend for MAF agents)