Lab 038: AI Cost OptimizationΒΆ
What You'll LearnΒΆ
- Understand AI API cost drivers (tokens, model choice, request frequency)
- Apply 5 key strategies to reduce costs: caching, batching, model routing, prompt compression, and streaming
- Implement semantic caching to avoid duplicate LLM calls
- Build a model router that uses cheap models for simple queries and expensive models for complex ones
- Monitor and alert on cost with Azure Monitor
IntroductionΒΆ
A production AI agent can easily cost $50β500/month even with modest usage if not optimized. The good news: most cost reduction strategies also improve latency and user experience.
The key cost levers:
| Factor | Impact | Optimization |
|---|---|---|
| Model choice | 10β50Γ cost difference | Route to cheaper model when possible |
| Token count | Linear cost | Compress prompts, trim history |
| Cache hits | Eliminate cost entirely | Semantic cache for repeated queries |
| Request volume | Linear cost | Batch, debounce, deduplicate |
| Output length | Linear cost | Use max_tokens, structured output |
Part 1: Measure Your BaselineΒΆ
Before optimizing, measure what you're spending:
# cost_tracker.py
from openai import OpenAI
import os
# Model pricing (USD per 1M tokens, as of 2025)
# Update these from: https://azure.microsoft.com/pricing/details/cognitive-services/openai-service/
MODEL_PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"gpt-3.5-turbo":{"input": 0.50, "output": 1.50},
}
class CostTracker:
def __init__(self):
self.total_input_tokens = 0
self.total_output_tokens = 0
self.total_requests = 0
self.total_cost_usd = 0.0
self.calls = []
def track(self, model: str, usage):
pricing = MODEL_PRICING.get(model, {"input": 0.01, "output": 0.03})
input_cost = (usage.prompt_tokens / 1_000_000) * pricing["input"]
output_cost = (usage.completion_tokens / 1_000_000) * pricing["output"]
call_cost = input_cost + output_cost
self.total_input_tokens += usage.prompt_tokens
self.total_output_tokens += usage.completion_tokens
self.total_requests += 1
self.total_cost_usd += call_cost
self.calls.append({"model": model, "cost_usd": call_cost})
return call_cost
def summary(self) -> str:
return (
f"Requests: {self.total_requests}\n"
f"Input tokens: {self.total_input_tokens:,}\n"
f"Output tokens: {self.total_output_tokens:,}\n"
f"Estimated cost: ${self.total_cost_usd:.6f}"
)
tracker = CostTracker()
client = OpenAI(
base_url="https://models.inference.ai.azure.com",
api_key=os.environ["GITHUB_TOKEN"],
)
def chat(messages, model="gpt-4o-mini") -> str:
response = client.chat.completions.create(model=model, messages=messages, max_tokens=200)
tracker.track(model, response.usage)
return response.choices[0].message.content
# Baseline: 10 product queries
queries = [
"What tents do you have?",
"Show me tents", # near-duplicate
"List your tent products", # near-duplicate
"What sleeping bags are available?",
"Tell me about sleeping bags", # near-duplicate
"Do you have winter sleeping bags?",
"What backpacks do you sell?",
"Tell me about backpacks", # near-duplicate
"What's the cheapest backpack?",
"Which backpack is most affordable?",# near-duplicate
]
print("=== BASELINE (no optimization) ===")
for query in queries:
answer = chat([
{"role": "system", "content": "You are an OutdoorGear product advisor."},
{"role": "user", "content": query},
])
print(f"Q: {query[:50]:<50} β {answer[:40]}...")
print("\n" + tracker.summary())
Part 2: Strategy 1 β Semantic CacheΒΆ
A semantic cache stores LLM responses and reuses them for semantically similar queries, even when the wording differs:
# semantic_cache.py
import json
import math
from openai import OpenAI
import os
client = OpenAI(
base_url="https://models.inference.ai.azure.com",
api_key=os.environ["GITHUB_TOKEN"],
)
def get_embedding(text: str) -> list[float]:
response = client.embeddings.create(model="text-embedding-3-small", input=text)
return response.data[0].embedding
def cosine_similarity(a: list[float], b: list[float]) -> float:
dot = sum(x*y for x, y in zip(a, b))
mag_a = math.sqrt(sum(x*x for x in a))
mag_b = math.sqrt(sum(x*x for x in b))
return dot / (mag_a * mag_b) if mag_a and mag_b else 0.0
class SemanticCache:
def __init__(self, similarity_threshold: float = 0.92):
self.threshold = similarity_threshold
self.cache: list[dict] = [] # [{query, embedding, response}]
self.hits = 0
self.misses = 0
def get(self, query: str) -> str | None:
query_emb = get_embedding(query)
for entry in self.cache:
score = cosine_similarity(query_emb, entry["embedding"])
if score >= self.threshold:
self.hits += 1
return entry["response"]
self.misses += 1
return None
def set(self, query: str, response: str) -> None:
embedding = get_embedding(query)
self.cache.append({"query": query, "embedding": embedding, "response": response})
@property
def hit_rate(self) -> float:
total = self.hits + self.misses
return self.hits / total if total > 0 else 0.0
cache = SemanticCache(similarity_threshold=0.92)
def chat_with_cache(query: str, system: str = "You are an OutdoorGear product advisor.") -> tuple[str, bool]:
cached = cache.get(query)
if cached:
return cached, True # (response, from_cache)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "system", "content": system}, {"role": "user", "content": query}],
max_tokens=200,
)
answer = response.choices[0].message.content
cache.set(query, answer)
return answer, False # (response, from_cache)
# Test with near-duplicates
print("=== WITH SEMANTIC CACHE ===")
for query in queries: # reuse queries from Part 1
answer, from_cache = chat_with_cache(query)
status = "πΎ CACHE" if from_cache else "π API"
print(f"{status} | {query[:45]:<45}")
print(f"\nCache hit rate: {cache.hit_rate:.0%}")
print(f"API calls saved: {cache.hits}")
Part 3: Strategy 2 β Model RouterΒΆ
Use cheap models for simple queries, expensive models for complex ones:
# model_router.py
import os
from openai import OpenAI
client = OpenAI(
base_url="https://models.inference.ai.azure.com",
api_key=os.environ["GITHUB_TOKEN"],
)
CHEAP_MODEL = "gpt-4o-mini" # $0.15/M input β for simple Q&A
PREMIUM_MODEL = "gpt-4o" # $2.50/M input β for complex reasoning
# Signals that indicate a complex query needing the premium model
COMPLEX_SIGNALS = [
"compare", "recommend", "which is better", "analyze", "explain why",
"trade-off", "pros and cons", "for my specific", "help me decide",
]
def route_query(query: str) -> str:
"""Return the appropriate model name based on query complexity."""
query_lower = query.lower()
if any(signal in query_lower for signal in COMPLEX_SIGNALS):
return PREMIUM_MODEL
return CHEAP_MODEL
def smart_chat(query: str, system: str = "You are an OutdoorGear product advisor.") -> dict:
model = route_query(query)
response = client.chat.completions.create(
model=model,
messages=[{"role": "system", "content": system}, {"role": "user", "content": query}],
max_tokens=300,
)
return {
"answer": response.choices[0].message.content,
"model": model,
"tokens": response.usage.total_tokens,
}
# Test routing
test_queries = [
("What tents do you have?", "β simple listing"),
("What's the price of the DayHiker 22L?", "β simple lookup"),
("Compare the TrailBlazer Tent 2P vs Summit Dome 4P", "β comparison"),
("I'm going on a 2-week alpine climb in January, which sleeping bag and tent should I buy and why?",
"β complex recommendation"),
]
print("=== MODEL ROUTER ===")
for query, description in test_queries:
result = smart_chat(query)
model_label = "π GPT-4o" if result["model"] == PREMIUM_MODEL else "β‘ GPT-4o-mini"
print(f"\n{model_label} {description}")
print(f" Q: {query[:70]}")
print(f" A: {result['answer'][:80]}...")
Part 4: Strategy 3 β Prompt CompressionΒΆ
Trim conversation history to keep only the most recent + summary:
# prompt_compression.py
MAX_HISTORY_TURNS = 3 # Keep last N turns before summarizing older ones
def compress_history(messages: list[dict], max_turns: int = MAX_HISTORY_TURNS) -> list[dict]:
"""
Keep the system message + last `max_turns` conversation pairs.
Summarize older turns into a single context message.
"""
system = [m for m in messages if m["role"] == "system"]
conversation = [m for m in messages if m["role"] != "system"]
if len(conversation) <= max_turns * 2:
return messages # No compression needed
older = conversation[:-max_turns * 2]
recent = conversation[-max_turns * 2:]
# Summarize older turns (in production: call the LLM to summarize)
summary_lines = []
for i in range(0, len(older), 2):
if i + 1 < len(older):
summary_lines.append(f"User asked: {older[i]['content'][:60]}...")
summary_msg = {
"role": "system",
"content": f"[Earlier conversation summary: {'; '.join(summary_lines)}]"
}
return system + [summary_msg] + recent
# Measure token savings
long_history = (
[{"role": "system", "content": "You are an OutdoorGear advisor."}]
+ [msg
for i in range(10)
for msg in [
{"role": "user", "content": f"Turn {i}: What about product P00{i%7+1}?"},
{"role": "assistant", "content": f"Product P00{i%7+1} is a great choice for outdoor activities."},
]]
)
compressed = compress_history(long_history, max_turns=3)
original_tokens = sum(len(m["content"].split()) for m in long_history)
compressed_tokens = sum(len(m["content"].split()) for m in compressed)
savings = 1 - compressed_tokens / original_tokens
print(f"=== PROMPT COMPRESSION ===")
print(f"Original: {len(long_history)} messages, ~{original_tokens} tokens")
print(f"Compressed: {len(compressed)} messages, ~{compressed_tokens} tokens")
print(f"Savings: {savings:.0%}")
Part 5: Cost Monitoring with Azure MonitorΒΆ
In production, track costs with Azure Monitor:
# azure_cost_monitor.py (requires azure-monitor-opentelemetry)
# pip install azure-monitor-opentelemetry
from azure.monitor.opentelemetry import configure_azure_monitor
from opentelemetry import metrics
configure_azure_monitor(
connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"]
)
meter = metrics.get_meter("outdoorgear.agent")
token_counter = meter.create_counter(
"ai.tokens.used",
description="Total LLM tokens consumed",
unit="tokens",
)
cost_gauge = meter.create_up_down_counter(
"ai.cost.usd",
description="Estimated AI API cost",
unit="USD",
)
def track_llm_call(model: str, input_tokens: int, output_tokens: int, cost: float):
attrs = {"model": model, "service": "outdoorgear-agent"}
token_counter.add(input_tokens, {**attrs, "type": "input"})
token_counter.add(output_tokens, {**attrs, "type": "output"})
cost_gauge.add(cost, attrs)
In the Azure Portal, create an alert:
Alert rule: AI cost > $10/day
Signal: Custom metric ai.cost.usd
Threshold: > 10 (sum over 24h)
Action: Email engineering@outdoorgear.example.com
π§ Knowledge CheckΒΆ
1. What is semantic caching, and how does it differ from exact-match caching?
Exact-match caching returns cached results only when the input is byte-for-byte identical. Semantic caching uses embeddings to find cached responses for semantically similar inputs β "Show me tents" and "What tents do you have?" get the same cached response even though the wording differs. Semantic caching is far more effective for conversational AI.
2. When should you NOT use a cheap model in a model router?
Use a premium model for: multi-step reasoning (chain-of-thought), comparisons between complex options, personalized recommendations that require weighing many factors, tasks where errors are costly (medical, financial, legal). Use cheap models for: lookups, listing, formatting, simple Q&A where accuracy tolerance is higher.
3. What are the risks of aggressively compressing prompt history?
The agent may lose context it needs: a user mentioned an allergy 5 turns ago and the compressed summary missed it. A poorly summarized history can cause contradictory responses in long conversations. Always keep the most recent N turns verbatim; only summarize older turns. Test with scenarios that require long-range context.
SummaryΒΆ
| Strategy | Potential Savings | Complexity |
|---|---|---|
| Semantic caching | 40β80% of API calls | Medium |
| Model routing | 10β40Γ cost reduction | Low |
| Prompt compression | 20β60% token reduction | Low |
| Batching requests | 10β30% (latency trade-off) | Medium |
Output limiting (max_tokens) |
5β20% | Very Low |
Next StepsΒΆ
- Observe what you optimized: β Lab 033 β Agent Observability
- Evaluate quality after optimization: β Lab 035 β Agent Evaluation
- Full production deployment: β Lab 028 β Deploy MCP to Azure