Skip to content

Lab 059: Voice Agents with GPT Realtime APIΒΆ

Level: L200 Path: All paths Time: ~75 min πŸ’° Cost: Free β€” Uses session dataset (Azure OpenAI optional)

What You'll LearnΒΆ

  • How the GPT-4o Realtime API enables full-duplex voice conversations with ~100 ms latency
  • Connect clients via WebRTC for low-latency, browser-native audio streaming
  • Handle interruptions (barge-in) β€” letting users cut in while the agent is still speaking
  • Integrate RAG with real-time audio so the agent retrieves product data mid-conversation
  • Analyze voice session metrics: latency percentiles, sentiment, and language distribution
  • Evaluate multi-language support (en, es, fr) in a single voice agent deployment

IntroductionΒΆ

Voice agents are shifting from traditional turn-taking β€” where the user speaks, waits, then the agent responds β€” to real-time conversation. The GPT-4o Realtime API processes speech input and generates speech output simultaneously, enabling natural back-and-forth dialogue with sub-100 ms latency.

OutdoorGear wants a voice assistant for product inquiries. Customers call in, ask about gear, and the agent responds with product details β€” all in real time. The system must handle interruptions gracefully (a customer can say "wait, actually…" mid-response), support multiple languages, and pull product information from a RAG pipeline on the fly.

Architecture OverviewΒΆ

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   WebRTC    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   REST/WS   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Browser  │◄──────────►│  Realtime API      │◄───────────►│  RAG      β”‚
β”‚  (mic +   β”‚  audio     β”‚  (GPT-4o-realtime) β”‚  tool calls β”‚  (product β”‚
β”‚  speaker) β”‚  stream    β”‚  β€’ VAD             β”‚             β”‚   search) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚  β€’ Barge-in        β”‚             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚  β€’ Turn detection  β”‚
                         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key concepts:

Concept Description
Realtime API Full-duplex speech-to-speech endpoint β€” no separate STT/TTS pipeline
WebRTC Browser-native protocol for low-latency audio/video streaming
VAD (Voice Activity Detection) Detects when the user starts/stops speaking
Barge-in User can interrupt the agent mid-response; the agent stops and listens
Server-side turn detection The API decides when a user turn is complete

PrerequisitesΒΆ

pip install pandas

This lab analyzes pre-recorded session data β€” no API key or Azure subscription required. To build a live voice agent, you would need an Azure OpenAI resource with the gpt-4o-realtime-preview model deployed.


Quick Start with GitHub Codespaces

Open in GitHub Codespaces

All dependencies are pre-installed in the devcontainer.

πŸ“¦ Supporting FilesΒΆ

Download these files before starting the lab

Save all files to a lab-059/ folder in your working directory.

File Description Download
broken_voice.py Bug-fix exercise (3 bugs + self-tests) πŸ“₯ Download
voice_sessions.csv Dataset πŸ“₯ Download

Part 1: Understanding Realtime API ArchitectureΒΆ

Step 1: How the Realtime API differs from Chat CompletionsΒΆ

The standard Chat Completions API follows a request-response pattern: send text, receive text. The Realtime API is fundamentally different:

Feature Chat Completions Realtime API
Input Text (JSON) Audio stream (PCM/WebRTC)
Output Text (JSON) Audio stream + text transcript
Latency 500–2000 ms ~100 ms (P50)
Duplex Half-duplex (request β†’ response) Full-duplex (simultaneous)
Interruption Not supported Barge-in supported
Protocol HTTP REST WebSocket / WebRTC

The ~100 ms target latency makes voice conversations feel natural β€” comparable to human-to-human phone calls.


Part 2: Load and Explore Voice Session DataΒΆ

Step 2: Load πŸ“₯ voice_sessions.csvΒΆ

OutdoorGear recorded 15 voice sessions during a pilot test of their Realtime API integration. Each session captures a customer interaction:

# voice_analysis.py
import pandas as pd

sessions = pd.read_csv("lab-059/voice_sessions.csv")
print(f"Total sessions: {len(sessions)}")
print(f"Columns: {list(sessions.columns)}")
print(sessions.head())

Expected output:

Total sessions: 15
Columns: ['session_id', 'scenario', 'duration_sec', 'latency_p50_ms',
           'latency_p95_ms', 'interruptions', 'turns', 'sentiment',
           'model', 'rag_used', 'language']

The dataset includes:

Column Description
session_id Unique session identifier (S01–S15)
scenario Type of interaction: product_inquiry, order_status, complaint, return_request, faq
duration_sec Total session duration in seconds
latency_p50_ms Median response latency in milliseconds
latency_p95_ms 95th percentile response latency
interruptions Number of times the user interrupted the agent
turns Total conversational turns
sentiment Overall session sentiment: positive, neutral, negative
model Model used (gpt-4o-realtime)
rag_used Whether RAG was invoked during the session
language Session language: en, es, fr

Part 3: Latency AnalysisΒΆ

Step 3: Measure response latency across sessionsΒΆ

Latency is the most critical metric for voice agents β€” anything above 200 ms feels laggy.

# Latency statistics
avg_p50 = sessions["latency_p50_ms"].mean()
avg_p95 = sessions["latency_p95_ms"].mean()

print(f"Average P50 latency: {avg_p50:.1f} ms")
print(f"Average P95 latency: {avg_p95:.1f} ms")
print(f"Min P50: {sessions['latency_p50_ms'].min()} ms")
print(f"Max P50: {sessions['latency_p50_ms'].max()} ms")

# Sessions exceeding 200ms at P95
slow = sessions[sessions["latency_p95_ms"] > 200]
print(f"\nSessions with P95 > 200ms: {len(slow)}")
print(slow[["session_id", "scenario", "latency_p95_ms"]])

Expected output:

Average P50 latency: 89.3 ms
Average P95 latency: 187.5 ms
Min P50: 75 ms
Max P50: 110 ms

Sessions with P95 > 200ms: 4
  session_id       scenario  latency_p95_ms
       S06  return_request             210
       S09       complaint             240
       S12  return_request             215
       S14       complaint             255

Latency Insight

The average P50 of 89.3 ms is well below the 100 ms target. However, complaint and return sessions consistently have higher latency β€” likely because they trigger longer RAG lookups and more complex reasoning.


Part 4: Sentiment AnalysisΒΆ

Step 4: Analyze session sentiment distributionΒΆ

# Sentiment breakdown
sentiment_counts = sessions["sentiment"].value_counts()
print("Sentiment Distribution:")
print(sentiment_counts)
print(f"\nPositive: {sentiment_counts.get('positive', 0)} sessions")
print(f"Neutral:  {sentiment_counts.get('neutral', 0)} sessions")
print(f"Negative: {sentiment_counts.get('negative', 0)} sessions")

# Which sessions are negative?
negative = sessions[sessions["sentiment"] == "negative"]
print(f"\nNegative sessions:")
print(negative[["session_id", "scenario", "duration_sec", "interruptions"]])

Expected output:

Sentiment Distribution:
positive    8
negative    4
neutral     3

Positive: 8 sessions (S01, S04, S05, S07, S10, S11, S13, S15)
Neutral:  3 sessions (S02, S08, S12)
Negative: 4 sessions (S03, S06, S09, S14)

Negative sessions:
  session_id       scenario  duration_sec  interruptions
       S03       complaint           120              3
       S06  return_request            65              1
       S09       complaint            90              4
       S14       complaint           105              5

Pattern

All 4 negative sessions are either complaints or return requests. Three of the four have 3+ interruptions β€” frustrated customers interrupt more frequently.


Part 5: RAG Usage PatternsΒΆ

Step 5: Analyze which sessions use RAGΒΆ

# RAG usage
rag_used = sessions[sessions["rag_used"] == True]
rag_not_used = sessions[sessions["rag_used"] == False]

print(f"RAG used: {len(rag_used)}/{len(sessions)} sessions ({len(rag_used)/len(sessions)*100:.0f}%)")
print(f"RAG not used: {len(rag_not_used)} sessions")
print(f"\nSessions without RAG:")
print(rag_not_used[["session_id", "scenario", "sentiment"]])

# Compare latency: RAG vs no-RAG
print(f"\nAvg P50 with RAG:    {rag_used['latency_p50_ms'].mean():.1f} ms")
print(f"Avg P50 without RAG: {rag_not_used['latency_p50_ms'].mean():.1f} ms")

Expected output:

RAG used: 12/15 sessions (80%)
RAG not used: 3 sessions

Sessions without RAG:
  session_id scenario sentiment
       S05      faq  positive
       S10      faq  positive
       S15      faq  positive

Avg P50 with RAG:    92.6 ms
Avg P50 without RAG: 76.3 ms

RAG Insight

The 3 sessions without RAG are all FAQ scenarios β€” simple questions that don't require product database lookups. FAQ sessions are also the shortest (15–20 seconds) and have the lowest latency.


Part 6: Interruption PatternsΒΆ

Step 6: Analyze barge-in behaviorΒΆ

Barge-in is when a user interrupts the agent mid-response. It's a key capability of the Realtime API β€” without it, voice agents feel robotic.

# Interruption analysis
print("Interruptions per session:")
print(sessions[["session_id", "scenario", "interruptions", "sentiment"]].to_string(index=False))

# Correlation between interruptions and sentiment
avg_interruptions = sessions.groupby("sentiment")["interruptions"].mean()
print(f"\nAvg interruptions by sentiment:")
print(avg_interruptions)

# Sessions with most interruptions
high_interrupt = sessions[sessions["interruptions"] >= 3]
print(f"\nHigh-interruption sessions (β‰₯3):")
print(high_interrupt[["session_id", "scenario", "interruptions", "sentiment"]])

Expected output:

Avg interruptions by sentiment:
negative    3.25
neutral     0.67
positive    0.63

High-interruption sessions (β‰₯3):
  session_id   scenario  interruptions sentiment
       S03  complaint              3  negative
       S09  complaint              4  negative
       S14  complaint              5  negative

Barge-in Insight

Negative sessions average 3.25 interruptions vs 0.63 for positive sessions. High interruption counts are a strong signal of customer frustration β€” an agent could detect this in real time and escalate to a human agent.


Part 7: Multi-Language SupportΒΆ

Step 7: Analyze language distributionΒΆ

# Language breakdown
lang_counts = sessions["language"].value_counts()
print("Language Distribution:")
print(lang_counts)

# Performance by language
for lang in sessions["language"].unique():
    lang_sessions = sessions[sessions["language"] == lang]
    print(f"\n{lang.upper()}: {len(lang_sessions)} sessions, "
          f"avg P50={lang_sessions['latency_p50_ms'].mean():.1f}ms, "
          f"avg sentiment: {lang_sessions['sentiment'].mode().iloc[0]}")

Expected output:

Language Distribution:
en    13
es     1
fr     1

EN: 13 sessions, avg P50=90.2ms, avg sentiment: positive
ES: 1 sessions, avg P50=82.0ms, avg sentiment: positive
FR: 1 sessions, avg P50=87.0ms, avg sentiment: positive

The Realtime API supports multiple languages natively β€” the same model handles English, Spanish, and French without separate deployments.


πŸ› Bug-Fix ExerciseΒΆ

The file lab-059/broken_voice.py has 3 bugs in the voice session analysis functions. Run the self-tests:

python lab-059/broken_voice.py

You should see 3 failed tests:

Test What it checks Hint
Test 1 Average P95 latency calculation Which latency column should you use?
Test 2 Count of negative sentiment sessions Are you filtering for the right sentiment value?
Test 3 RAG usage rate as a percentage What should the denominator be?

Fix all 3 bugs and re-run until you see πŸŽ‰ All 3 tests passed.


🧠 Knowledge Check¢

Q1 (Multiple Choice): What is the target response latency for the GPT-4o Realtime API?
  • A) ~500 ms β€” fast enough for most voice applications
  • B) ~100 ms β€” comparable to human-to-human conversation latency
  • C) ~10 ms β€” near-instantaneous for real-time gaming
  • D) ~1000 ms β€” acceptable for batch voice processing
βœ… Reveal Answer

Correct: B) ~100 ms

The Realtime API targets ~100 ms P50 latency, which is comparable to the natural pauses in human conversation. At this speed, voice interactions feel fluid and natural. The session data confirms this β€” the average P50 across 15 sessions is 89.3 ms.

Q2 (Multiple Choice): What does 'barge-in' mean in the context of voice agents?
  • A) The agent interrupts the user to provide urgent information
  • B) The user can interrupt the agent mid-response and the agent stops to listen
  • C) Multiple users can join the same voice session simultaneously
  • D) The agent switches between languages mid-conversation
βœ… Reveal Answer

Correct: B) The user can interrupt the agent mid-response and the agent stops to listen

Barge-in is a critical feature of natural voice conversation. When a user says "wait, actually…" while the agent is still speaking, the agent immediately stops its current response and processes the new input. Without barge-in, users must wait for the agent to finish β€” creating a frustrating, robotic experience.

Q3 (Run the Lab): What is the average P95 latency across all 15 voice sessions?

Calculate sessions["latency_p95_ms"].mean().

βœ… Reveal Answer

187.5 ms

The P95 values range from 150 ms (S10, an FAQ session) to 255 ms (S14, a complaint). The mean across all 15 sessions is (170+185+195+180+155+210+165+175+240+150+178+215+188+255+152) / 15 = 187.5 ms. Four sessions exceed the 200 ms threshold β€” all are complaints or return requests.

Q4 (Run the Lab): How many sessions have negative sentiment?

Filter sessions[sessions["sentiment"] == "negative"] and count.

βœ… Reveal Answer

4 sessions

Sessions S03, S06, S09, and S14 have negative sentiment. All four are either complaints (S03, S09, S14) or return requests (S06). These sessions also have the highest latency and interruption counts, suggesting a correlation between customer frustration and system performance under complex scenarios.

Q5 (Run the Lab): What percentage of sessions use RAG?

Calculate (sessions with rag_used == True) / total sessions * 100.

βœ… Reveal Answer

80% (12 out of 15)

12 of 15 sessions use RAG. The 3 sessions without RAG (S05, S10, S15) are all FAQ scenarios β€” simple questions that the model answers from its training data without needing product database lookups. FAQ sessions also have the lowest latency, confirming that RAG adds a measurable (but small) latency overhead.


SummaryΒΆ

Topic What You Learned
Realtime API Full-duplex speech-to-speech with ~100 ms latency
WebRTC Browser-native protocol for low-latency audio streaming
Barge-in Users can interrupt mid-response for natural conversation flow
RAG + Voice 80% of sessions use RAG; FAQ sessions skip it for lower latency
Sentiment Negative sessions correlate with complaints, high latency, and interruptions
Multi-language Same model handles en, es, fr without separate deployments

Next StepsΒΆ

  • Lab 043 β€” Multimodal Agents with GPT-4o Vision (complementary modality)
  • Lab 060 β€” Reasoning Models: Chain-of-Thought with o3 and DeepSeek R1
  • Lab 019 β€” Streaming Responses (foundational streaming concepts)