Skip to content

Lab 060: Reasoning Models β€” Chain-of-Thought with o3 and DeepSeek R1ΒΆ

Level: L200 Path: All paths Time: ~75 min πŸ’° Cost: Free β€” Uses benchmark dataset (Azure OpenAI optional)

What You'll LearnΒΆ

  • How reasoning models (o3, DeepSeek R1) differ from standard models (GPT-4o) β€” extended thinking, chain-of-thought
  • What a thinking budget is and how it controls the depth of model reasoning
  • Compare accuracy, speed, and token cost across 3 models on 12 benchmark problems
  • Identify which problem categories and difficulty levels benefit most from reasoning
  • Apply a decision framework: when to use reasoning models vs standard models
  • Understand cost-performance trade-offs for production deployments

IntroductionΒΆ

Standard LLMs like GPT-4o generate answers in a single forward pass β€” fast, but they can stumble on problems that require multi-step logical reasoning. Reasoning models like o3 and DeepSeek R1 take a different approach: they use extended thinking (chain-of-thought) to break complex problems into steps, verify intermediate results, and backtrack when they detect errors.

The trade-off is clear: reasoning models are slower and use more tokens, but they achieve dramatically higher accuracy on hard problems.

The BenchmarkΒΆ

You'll compare 3 models on 12 problems across 4 categories:

Category Easy Medium Hard
Math Compound interest System of equations Prove √2 is irrational
Code Reverse a string Binary search Thread-safe LRU cache
Logic Syllogism Three boxes puzzle Wolf-goat-cabbage
Planning Hiking itinerary Delivery route Microservices migration

PrerequisitesΒΆ

pip install pandas

This lab analyzes pre-computed benchmark results β€” no API key or Azure subscription required. To run live benchmarks, you would need access to GPT-4o, o3, and DeepSeek R1 via Azure OpenAI or the respective APIs.


Quick Start with GitHub Codespaces

Open in GitHub Codespaces

All dependencies are pre-installed in the devcontainer.

πŸ“¦ Supporting FilesΒΆ

Download these files before starting the lab

Save all files to a lab-060/ folder in your working directory.

File Description Download
broken_reasoning.py Bug-fix exercise (3 bugs + self-tests) πŸ“₯ Download
reasoning_benchmark.csv Benchmark dataset πŸ“₯ Download

Part 1: Understanding Reasoning ModelsΒΆ

Step 1: How reasoning models workΒΆ

Standard models generate tokens left-to-right without pausing to "think." Reasoning models add an internal deliberation phase:

Standard (GPT-4o):
  Input β†’ [Generate tokens] β†’ Output

Reasoning (o3 / DeepSeek R1):
  Input β†’ [Think: break into steps] β†’ [Verify each step] β†’ [Backtrack if needed] β†’ Output

Key concepts:

Concept Description
Chain-of-thought The model explicitly reasons through intermediate steps before answering
Thinking budget Controls how much reasoning the model does (more budget = more thorough = slower)
Extended thinking The model's internal deliberation β€” visible in some APIs as "thinking tokens"
Self-verification The model checks its own intermediate results and corrects mistakes

Thinking Budget

The thinking budget controls how much reasoning the model does before producing a final answer. A higher budget lets the model explore more solution paths and verify more thoroughly β€” but costs more tokens and takes more time. For simple questions, a low budget suffices; for complex proofs, you want the full budget.


Part 2: Load Benchmark DataΒΆ

Step 2: Load πŸ“₯ reasoning_benchmark.csvΒΆ

The benchmark dataset contains results from running all 12 problems through each model:

# reasoning_analysis.py
import pandas as pd

bench = pd.read_csv("lab-060/reasoning_benchmark.csv")

# Convert boolean columns
for model in ["gpt4o", "o3", "deepseek_r1"]:
    bench[f"{model}_correct"] = bench[f"{model}_correct"].astype(str).str.lower() == "true"

print(f"Problems: {len(bench)}")
print(f"Categories: {bench['category'].unique().tolist()}")
print(f"Difficulties: {bench['difficulty'].unique().tolist()}")
print(bench[["problem_id", "category", "difficulty"]].to_string(index=False))

Expected output:

Problems: 12
Categories: ['math', 'code', 'logic', 'planning']
Difficulties: ['easy', 'medium', 'hard']

problem_id category difficulty
       P01     math       easy
       P02     math     medium
       P03     math       hard
       P04     code       easy
       P05     code     medium
       P06     code       hard
       P07    logic       easy
       P08    logic     medium
       P09    logic       hard
       P10 planning       easy
       P11 planning     medium
       P12 planning       hard

Part 3: Overall Accuracy ComparisonΒΆ

Step 3: Calculate accuracy for each modelΒΆ

# Overall accuracy
for model in ["gpt4o", "o3", "deepseek_r1"]:
    correct = bench[f"{model}_correct"].sum()
    total = len(bench)
    print(f"{model:>12}: {correct}/{total} = {correct/total*100:.1f}%")

Expected output:

      gpt4o: 6/12 = 50.0%
          o3: 12/12 = 100.0%
 deepseek_r1: 11/12 = 91.7%

Key Finding

GPT-4o gets only half the problems right, while o3 achieves a perfect score. DeepSeek R1 misses just one problem (P12 β€” the hardest planning problem). The gap is dramatic on hard problems.

# Which problems does GPT-4o get wrong?
gpt4o_fails = bench[bench["gpt4o_correct"] == False]
print("GPT-4o failures:")
print(gpt4o_fails[["problem_id", "category", "difficulty", "description"]].to_string(index=False))

Expected output:

GPT-4o failures:
problem_id category difficulty                                       description
       P03     math       hard                    Prove that sqrt(2) is irrational
       P06     code       hard          Design a thread-safe LRU cache in Python
       P08    logic     medium  Three boxes puzzle: one has gold - find the optimal strategy
       P09    logic       hard  River crossing puzzle with wolf-goat-cabbage constraints
       P11 planning     medium  Optimize a delivery route for 5 stops minimizing distance
       P12 planning       hard  Design a microservices migration plan for a monolith app

GPT-4o fails on all hard problems plus two medium problems (P08, P11) that require multi-step reasoning.

# What does DeepSeek R1 get wrong?
r1_fails = bench[bench["deepseek_r1_correct"] == False]
print("DeepSeek R1 failures:")
print(r1_fails[["problem_id", "category", "difficulty", "description"]].to_string(index=False))

Expected output:

DeepSeek R1 failures:
problem_id  category difficulty                                          description
       P12  planning       hard  Design a microservices migration plan for a monolith app

DeepSeek R1 fails only on P12 β€” the most complex planning problem requiring both technical knowledge and multi-step project planning.


Part 4: Accuracy by Category and DifficultyΒΆ

Step 4: Break down accuracy by categoryΒΆ

# Accuracy by category
for category in bench["category"].unique():
    cat_data = bench[bench["category"] == category]
    print(f"\n{category.upper()}:")
    for model in ["gpt4o", "o3", "deepseek_r1"]:
        correct = cat_data[f"{model}_correct"].sum()
        total = len(cat_data)
        print(f"  {model:>12}: {correct}/{total}")

Expected output:

MATH:
        gpt4o: 2/3
            o3: 3/3
   deepseek_r1: 3/3

CODE:
        gpt4o: 2/3
            o3: 3/3
   deepseek_r1: 3/3

LOGIC:
        gpt4o: 1/3
            o3: 3/3
   deepseek_r1: 3/3

PLANNING:
        gpt4o: 1/3
            o3: 3/3
   deepseek_r1: 2/3
# Accuracy by difficulty
for diff in ["easy", "medium", "hard"]:
    diff_data = bench[bench["difficulty"] == diff]
    print(f"\n{diff.upper()}:")
    for model in ["gpt4o", "o3", "deepseek_r1"]:
        correct = diff_data[f"{model}_correct"].sum()
        total = len(diff_data)
        print(f"  {model:>12}: {correct}/{total} = {correct/total*100:.0f}%")

Expected output:

EASY:
        gpt4o: 4/4 = 100%
            o3: 4/4 = 100%
   deepseek_r1: 4/4 = 100%

MEDIUM:
        gpt4o: 2/4 = 50%
            o3: 4/4 = 100%
   deepseek_r1: 4/4 = 100%

HARD:
        gpt4o: 0/4 = 0%
            o3: 4/4 = 100%
   deepseek_r1: 3/4 = 75%

Difficulty Insight

All three models ace easy problems. The gap appears at medium difficulty (GPT-4o drops to 50%) and becomes dramatic on hard problems (GPT-4o: 0%, DeepSeek R1: 75%, o3: 100%). Reasoning models earn their keep on hard problems.


Part 5: Speed vs Accuracy Trade-offsΒΆ

Step 5: Analyze response time by modelΒΆ

# Average time per model
for model in ["gpt4o", "o3", "deepseek_r1"]:
    avg_time = bench[f"{model}_time_sec"].mean()
    print(f"{model:>12}: {avg_time:.1f}s average")

# Time vs accuracy scatter
print("\nProblem-level detail:")
for _, row in bench.iterrows():
    print(f"  {row['problem_id']} ({row['difficulty']:>6}): "
          f"GPT-4o={row['gpt4o_time_sec']:.1f}s "
          f"o3={row['o3_time_sec']:.1f}s "
          f"R1={row['deepseek_r1_time_sec']:.1f}s")

Expected output:

      gpt4o: 2.1s average
          o3: 7.1s average
 deepseek_r1: 5.4s average

Problem-level detail:
  P01 (  easy): GPT-4o=1.2s o3=3.5s R1=2.8s
  P02 (medium): GPT-4o=1.8s o3=4.2s R1=3.5s
  P03 (  hard): GPT-4o=2.5s o3=8.1s R1=6.5s
  ...
  P12 (  hard): GPT-4o=4.0s o3=15.0s R1=11.0s

Speed Trade-off

o3 is 3.4Γ— slower than GPT-4o on average (7.1s vs 2.1s). On the hardest problem (P12), o3 takes 15 seconds β€” acceptable for complex tasks, but too slow for real-time chat. Choose your model based on problem complexity, not blanket deployment.


Part 6: Token Cost AnalysisΒΆ

Step 6: Compare token usageΒΆ

# Average tokens per model
for model in ["gpt4o", "o3", "deepseek_r1"]:
    avg_tokens = bench[f"{model}_tokens"].mean()
    total_tokens = bench[f"{model}_tokens"].sum()
    print(f"{model:>12}: {avg_tokens:.0f} avg tokens, {total_tokens:,} total")

# Cost ratio (relative to GPT-4o)
gpt4o_total = bench["gpt4o_tokens"].sum()
for model in ["o3", "deepseek_r1"]:
    model_total = bench[f"{model}_tokens"].sum()
    ratio = model_total / gpt4o_total
    print(f"\n{model} uses {ratio:.1f}Γ— more tokens than GPT-4o")

Expected output:

      gpt4o: 287 avg tokens, 3,440 total
          o3: 878 avg tokens, 10,530 total
 deepseek_r1: 725 avg tokens, 8,700 total

o3 uses 3.1Γ— more tokens than GPT-4o
deepseek_r1 uses 2.5Γ— more tokens than GPT-4o

The extra tokens come from chain-of-thought reasoning β€” the model is "thinking out loud" internally. This is the cost of higher accuracy.


Part 7: When to Use Each ModelΒΆ

Step 7: Decision frameworkΒΆ

Based on the benchmark results, here's when to use each model:

Scenario Recommended Model Why
Simple Q&A, FAQ GPT-4o 100% accuracy on easy problems, 3Γ— faster, 3Γ— cheaper
Multi-step reasoning o3 or DeepSeek R1 GPT-4o drops to 0% on hard problems
Cost-sensitive production DeepSeek R1 91.7% accuracy at 2.5Γ— tokens (vs o3's 3.1Γ—)
Maximum accuracy required o3 100% accuracy, but 3.4Γ— slower and 3.1Γ— more expensive
Real-time conversation GPT-4o 2.1s avg β€” reasoning models are too slow for chat
Code generation (complex) o3 Thread-safe, concurrent code needs careful reasoning
Mathematical proofs o3 or DeepSeek R1 Both handle formal proofs; GPT-4o cannot
# Summary dashboard
print("""
╔══════════════════════════════════════════════════════╗
β•‘      Reasoning Model Benchmark β€” Summary             β•‘
╠══════════════════════════════════════════════════════╣
β•‘  Model        Accuracy   Avg Time   Avg Tokens       β•‘
β•‘  ─────────    ────────   ────────   ──────────       β•‘
β•‘  GPT-4o        50.0%      2.1s        287            β•‘
β•‘  o3           100.0%      7.1s        878            β•‘
β•‘  DeepSeek R1   91.7%      5.4s        725            β•‘
╠══════════════════════════════════════════════════════╣
β•‘  Key Insight: Use GPT-4o for simple tasks,           β•‘
β•‘  reasoning models for complex multi-step problems.   β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
""")

πŸ› Bug-Fix ExerciseΒΆ

The file lab-060/broken_reasoning.py has 3 bugs in the benchmark analysis functions. Run the self-tests:

python lab-060/broken_reasoning.py

You should see 3 failed tests:

Test What it checks Hint
Test 1 Model accuracy calculation Which column represents correctness β€” _correct or _time_sec?
Test 2 Finding the fastest model Should you use min or max to find the fastest?
Test 3 Hard-problem accuracy Which difficulty level are you filtering for?

Fix all 3 bugs and re-run until you see πŸŽ‰ All 3 tests passed.


🧠 Knowledge Check¢

Q1 (Multiple Choice): When should you use a reasoning model instead of a standard model like GPT-4o?
  • A) For all tasks β€” reasoning models are always better
  • B) For complex multi-step problems requiring logical reasoning, proofs, or planning
  • C) For real-time chat applications where speed is critical
  • D) For simple FAQ and classification tasks
βœ… Reveal Answer

Correct: B) For complex multi-step problems requiring logical reasoning, proofs, or planning

Reasoning models excel when problems require breaking down into steps, verifying intermediate results, or exploring multiple solution paths. GPT-4o achieves 100% on easy problems β€” reasoning models add no value there but cost 3Γ— more. Reserve reasoning models for hard problems where GPT-4o's single-pass approach fails.

Q2 (Multiple Choice): What does the 'thinking budget' control in a reasoning model?
  • A) The maximum number of API calls per minute
  • B) The total cost in dollars for a single request
  • C) How much reasoning the model does before producing a final answer
  • D) The maximum length of the output response
βœ… Reveal Answer

Correct: C) How much reasoning the model does before producing a final answer

The thinking budget controls the depth of the model's internal deliberation. A higher budget allows the model to explore more solution paths, verify intermediate steps more thoroughly, and backtrack when it detects errors. This produces more accurate results but consumes more tokens and takes more time.

Q3 (Run the Lab): What is o3's accuracy on the 12-problem benchmark?

Calculate bench["o3_correct"].sum() / len(bench) * 100.

βœ… Reveal Answer

100% (12/12)

o3 correctly solves all 12 problems across every category and difficulty level β€” including P12 (microservices migration plan), which is the only problem DeepSeek R1 fails. This perfect score comes at a cost: o3 averages 7.1 seconds and 878 tokens per problem.

Q4 (Run the Lab): What is GPT-4o's accuracy on the benchmark?

Calculate bench["gpt4o_correct"].sum() / len(bench) * 100.

βœ… Reveal Answer

50% (6/12)

GPT-4o correctly solves 6 of 12 problems. It gets all 4 easy problems right but fails on all 4 hard problems (P03, P06, P09, P12) and 2 medium problems (P08, P11). The failures span all categories β€” math, code, logic, and planning β€” confirming that the issue is reasoning depth, not domain knowledge.

Q5 (Run the Lab): Which model fails only on problem P12?

Check which model has _correct == False for exactly one problem, and that problem is P12.

βœ… Reveal Answer

DeepSeek R1

DeepSeek R1 achieves 91.7% accuracy (11/12), failing only on P12 β€” "Design a microservices migration plan for a monolith app." This is the hardest planning problem, requiring both deep technical knowledge and complex multi-step project planning. o3 solves it; GPT-4o fails on it plus 5 other problems.


SummaryΒΆ

Topic What You Learned
Reasoning Models Extended thinking via chain-of-thought for complex problems
Thinking Budget Controls reasoning depth β€” more budget = more accurate but slower
Accuracy GPT-4o: 50%, DeepSeek R1: 91.7%, o3: 100% on 12-problem benchmark
Speed Trade-off GPT-4o: 2.1s avg, DeepSeek R1: 5.4s, o3: 7.1s β€” reasoning costs time
Token Cost Reasoning models use 2.5–3.1Γ— more tokens than GPT-4o
Decision Framework Use GPT-4o for simple tasks; reasoning models for hard multi-step problems

Next StepsΒΆ

  • Lab 059 β€” Voice Agents with GPT Realtime API (real-time interaction, different modality)
  • Lab 043 β€” Multimodal Agents with GPT-4o Vision (another GPT-4o capability)
  • Lab 038 β€” Cost Optimization (applying the cost-performance trade-offs from this lab)