Lab 015: Ollama β Run LLMs Locally for FreeΒΆ
Also try Foundry Local
Microsoft Foundry Local is an alternative to Ollama with an OpenAI-compatible API. See Lab 078: Foundry Local for a hands-on guide.
What You'll LearnΒΆ
- Install and run Ollama to serve LLMs locally
- Run Phi-4 (Microsoft's powerful small model) and Llama 3.2 on your own machine
- Generate text embeddings locally with
nomic-embed-text - Call Ollama from Python and C# using the OpenAI-compatible API
- Use Ollama as the LLM backend for Semantic Kernel (no API key needed)
IntroductionΒΆ
Ollama is an open-source tool that makes running LLMs on your laptop as easy as ollama run phi4. No API key, no cloud account, no usage costs β just your own hardware.
This is valuable for: - Privacy: sensitive data never leaves your machine - Offline development: works without internet - Cost control: zero API costs during development - Learning: experiment freely without worrying about bills
Hardware requirements
Ollama works on Mac (Apple Silicon or Intel), Windows, and Linux.
For best performance: 16GB+ RAM. Works with 8GB but slower.
GPU is optional β models run on CPU too (just slower).
Prerequisites SetupΒΆ
Install OllamaΒΆ
- Go to ollama.com and download the installer for your OS
- Install and verify:
Ollama runs as a background service on http://localhost:11434.
π¦ Supporting FilesΒΆ
Download these files before starting the lab
Save all files to a lab-015/ folder in your working directory.
| File | Description | Download |
|---|---|---|
Modelfile |
Ollama model configuration | π₯ Download |
chat_starter.py |
Starter script with TODOs | π₯ Download |
Lab ExerciseΒΆ
Step 1: Run your first modelΒΆ
This downloads Phi-4 (~9GB) on first run, then starts an interactive chat.
>>> What are AI agents?
AI agents are autonomous systems that use LLMs as their reasoning engine...
>>> /bye
Other models to try:
ollama run llama3.2 # Meta Llama 3.2 3B β fast, small
ollama run llama3.2:1b # Even smaller, very fast
ollama run mistral # Mistral 7B β good balance
ollama run deepseek-r1 # Reasoning model (like o1)
ollama run phi4-mini # Phi-4 Mini β faster, less RAM
Check what you have installed:
Step 2: Pull an embedding modelΒΆ
This gives you a free local embedding model β perfect for RAG without any API costs.
Step 3: Call Ollama from PythonΒΆ
Ollama's API is 100% OpenAI-compatible, so the same code that calls GitHub Models or Azure OpenAI works here:
from openai import OpenAI
# Point to local Ollama instead of OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required by the client, but value doesn't matter
)
response = client.chat.completions.create(
model="phi4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the difference between RAG and fine-tuning in 3 sentences."},
],
temperature=0.3,
)
print(response.choices[0].message.content)
Step 4: Generate embeddings locallyΒΆ
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)
response = client.embeddings.create(
model="nomic-embed-text",
input="waterproof hiking boots for mountain trails",
)
vector = response.data[0].embedding
print(f"Dimensions: {len(vector)}") # 768
print(f"First 5: {vector[:5]}")
Step 5: Use Ollama with Semantic KernelΒΆ
Because Ollama is OpenAI-compatible, plugging it into Semantic Kernel is trivial:
import asyncio
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import OpenAIChatCompletion
async def main():
kernel = Kernel()
# Use Ollama instead of GitHub Models β just change base_url and model
kernel.add_service(
OpenAIChatCompletion(
ai_model_id="phi4",
api_key="ollama",
base_url="http://localhost:11434/v1",
)
)
# The rest of your agent code is identical!
from semantic_kernel.contents import ChatHistory
history = ChatHistory()
history.add_system_message("You are a helpful AI assistant.")
history.add_user_message("What is Semantic Kernel?")
chat = kernel.get_service(type=OpenAIChatCompletion)
result = await chat.get_chat_message_content(
chat_history=history,
settings=kernel.get_prompt_execution_settings_from_service_id("default"),
)
print(result)
asyncio.run(main())
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.ChatCompletion;
var builder = Kernel.CreateBuilder();
builder.AddOpenAIChatCompletion(
modelId: "phi4",
apiKey: "ollama",
endpoint: new Uri("http://localhost:11434/v1")
);
var kernel = builder.Build();
var chat = kernel.GetRequiredService<IChatCompletionService>();
var history = new ChatHistory("You are a helpful AI assistant.");
history.AddUserMessage("What is Semantic Kernel?");
var response = await chat.GetChatMessageContentAsync(history);
Console.WriteLine(response.Content);
Step 6: Use Ollama as an MCP server backendΒΆ
Since Ollama is OpenAI-compatible, any MCP server that calls an LLM can use it locally. Just swap the client configuration:
# In your MCP server's config.py
LLM_BASE_URL = "http://localhost:11434/v1"
LLM_MODEL = "phi4"
EMBED_MODEL = "nomic-embed-text"
LLM_API_KEY = "ollama"
No other code changes needed.
Step 7: Ollama via REST API directlyΒΆ
You can also call Ollama's native API (not OpenAI-compatible):
curl http://localhost:11434/api/chat -d '{
"model": "phi4",
"messages": [
{"role": "user", "content": "Why is the sky blue?"}
],
"stream": false
}'
π Starter FilesΒΆ
Two files are provided to help you follow along:
# Chat with any local model
python chat_starter.py
# Create the OutdoorGear custom model first:
ollama create outdoorgear -f Modelfile
ollama run outdoorgear
The π₯ Modelfile creates a custom OutdoorGear Advisor persona on top of Phi-4. The π₯ chat_starter.py has 5 exercises covering basic completion, custom models, comparison, and streaming.
Model Comparison (on a typical laptop)ΒΆ
| Model | Size | RAM needed | Speed | Quality |
|---|---|---|---|---|
phi4-mini |
2.5GB | 4GB | β‘β‘β‘ Fast | Good |
llama3.2:1b |
1.3GB | 4GB | β‘β‘β‘ Very fast | Basic |
llama3.2 |
2.0GB | 6GB | β‘β‘ Fast | Good |
phi4 |
9.1GB | 12GB | β‘ Moderate | Excellent |
mistral |
4.1GB | 8GB | β‘β‘ Fast | Very good |
deepseek-r1 |
4.7GB | 8GB | β‘ Moderate | Best reasoning |
SummaryΒΆ
You now have a fully local LLM stack:
- β
Ollama serving models on
localhost:11434 - β Phi-4 (or Llama) for chat/reasoning β free, private, offline
- β nomic-embed-text for embeddings β free, local
- β Same code works for Ollama, GitHub Models, and Azure OpenAI β just change base URL
Next StepsΒΆ
- Build a RAG app with local embeddings: β Lab 022 β RAG with GitHub Models + pgvector
- Use with Semantic Kernel plugins: β Lab 023 β SK Plugins, Memory & Planners
- Production local AI: β Lab 044 β Phi-4 + Ollama in Production