Lab 072: Structured Outputs β Guaranteed JSON for AgentsΒΆ
What You'll LearnΒΆ
- What Structured Outputs are and why agents need guaranteed JSON
- How JSON Schema enforcement differs from free-form "please return JSON" prompts
- Analyze extraction test results comparing schema-enforced vs. no-schema outputs
- Measure schema validity rates and field accuracy across input types
- Build a reliability report proving structured outputs eliminate parsing failures
IntroductionΒΆ
When an agent extracts information from unstructured text β emails, invoices, resumes, support tickets β it needs to return structured data that downstream systems can parse reliably. Without schema enforcement, even the best models occasionally return malformed JSON, missing fields, or unexpected types.
Structured Outputs solve this by constraining the model's output to a JSON Schema at decoding time. The model literally cannot produce invalid JSON.
| Approach | Validity Guarantee | Field Accuracy | Parsing Failures |
|---|---|---|---|
| Free-form prompt ("return JSON") | β No guarantee | Variable | Common |
| JSON mode | β Valid JSON | Variable | Rare |
| Structured Outputs (JSON Schema) | β Valid + schema-compliant | High | Zero |
The ScenarioΒΆ
You are a Data Engineer building an extraction pipeline for an insurance company. The pipeline processes 5 document types: emails, invoices, resumes, support tickets, and product reviews. You've run 15 extraction tests β 10 with schema enforcement and 5 without β and need to prove that structured outputs are production-ready.
Your dataset (structured_outputs.csv) contains the results. Your job: analyze validity rates, field accuracy, and build the case for schema enforcement.
Mock Data
This lab uses a mock test results CSV. The patterns mirror real-world behavior: schema-enforced outputs achieve near-perfect accuracy, while free-form outputs are inconsistent.
PrerequisitesΒΆ
| Requirement | Why |
|---|---|
| Python 3.10+ | Run the analysis scripts |
pandas library |
Data manipulation |
π¦ Supporting FilesΒΆ
Download these files before starting the lab
Save all files to a lab-072/ folder in your working directory.
| File | Description | Download |
|---|---|---|
broken_structured.py |
Bug-fix exercise (3 bugs + self-tests) | π₯ Download |
structured_outputs.csv |
Dataset | π₯ Download |
Step 1: Understand Structured OutputsΒΆ
Structured Outputs work by providing a JSON Schema alongside your prompt. The model's decoder is constrained to only produce tokens that result in valid JSON matching the schema.
Example Schema (Pydantic)ΒΆ
from pydantic import BaseModel
from typing import List
class EmailExtraction(BaseModel):
name: str
email: str
subject: str
urgency: str # "low", "medium", "high"
class InvoiceExtraction(BaseModel):
vendor: str
amount: float
date: str
line_items: List[str]
How It WorksΒΆ
- You define a JSON Schema (or Pydantic model)
- You pass it to the API alongside your prompt
- The model's token sampling is constrained to match the schema
- The output is guaranteed to be valid JSON matching your schema β every field present, every type correct
Pydantic Integration
OpenAI's Python SDK can accept a Pydantic model directly via response_format=EmailExtraction. The SDK handles the schema conversion automatically.
Step 2: Load and Explore the Test ResultsΒΆ
The dataset has 15 extraction tests β 10 with schema enforcement (gpt-4o) and 5 without (gpt-4o-no-schema):
import pandas as pd
df = pd.read_csv("lab-072/structured_outputs.csv")
# Convert string booleans
for col in ["structured_output_valid", "json_parse_success"]:
df[col] = df[col].astype(str).str.strip().str.lower() == "true"
print(f"Total tests: {len(df)}")
print(f"Models: {df['model'].unique().tolist()}")
print(f"Input types: {df['input_type'].unique().tolist()}")
print(f"\nFirst 5 rows:\n{df.head()}")
Expected output:
Total tests: 15
Models: ['gpt-4o', 'gpt-4o-no-schema']
Input types: ['email', 'invoice', 'resume', 'support_ticket', 'product_review']
Step 3: Compare Schema Validity RatesΒΆ
The structured_output_valid column indicates whether the output matched the expected schema (all fields present, correct types):
schema_rows = df[df["model"] == "gpt-4o"]
no_schema_rows = df[df["model"] == "gpt-4o-no-schema"]
schema_valid_rate = schema_rows["structured_output_valid"].mean() * 100
no_schema_valid_rate = no_schema_rows["structured_output_valid"].mean() * 100
print(f"Schema-enforced validity rate: {schema_valid_rate:.0f}%")
print(f"No-schema validity rate: {no_schema_valid_rate:.0f}%")
Expected output:
Insight
100% vs. 0% β this is the entire argument for structured outputs. With schema enforcement, every single extraction passes validation. Without it, none do (some may parse as JSON, but fields are missing or types are wrong).
Step 4: Analyze Field AccuracyΒΆ
Even when JSON is valid, the extracted values may not be accurate. The field_accuracy_pct column measures how many fields had the correct value:
schema_accuracy = schema_rows["field_accuracy_pct"].mean()
no_schema_accuracy = no_schema_rows["field_accuracy_pct"].mean()
print(f"Avg field accuracy (with schema): {schema_accuracy:.0f}%")
print(f"Avg field accuracy (without schema): {no_schema_accuracy:.0f}%")
Expected output:
Break it down by input type:
accuracy_by_type = df.groupby(["input_type", "model"])["field_accuracy_pct"].mean().unstack()
print(accuracy_by_type.round(1))
# Which input types show the biggest accuracy gap?
for input_type in df["input_type"].unique():
schema_acc = df[(df["input_type"] == input_type) & (df["model"] == "gpt-4o")]["field_accuracy_pct"].mean()
no_schema_acc = df[(df["input_type"] == input_type) & (df["model"] == "gpt-4o-no-schema")]["field_accuracy_pct"].mean()
gap = schema_acc - no_schema_acc if not pd.isna(no_schema_acc) else 0
print(f" {input_type:>20s}: schema={schema_acc:.0f}% no-schema={no_schema_acc:.0f}% gap={gap:.0f}pp")
Step 5: Measure Latency and Token UsageΒΆ
Schema enforcement has a small overhead β the model must conform to constraints during decoding:
for model in df["model"].unique():
subset = df[df["model"] == model]
avg_time = subset["time_ms"].mean()
avg_tokens = subset["tokens"].mean()
print(f"{model:>20s}: avg_time={avg_time:.0f}ms avg_tokens={avg_tokens:.0f}")
Expected output:
Latency Trade-off
Schema-enforced outputs are ~38% slower on average. This is expected β constrained decoding requires additional processing. For most agent workflows, the reliability guarantee far outweighs the latency cost.
Step 6: Build the Reliability ReportΒΆ
report = f"""# π Structured Outputs Reliability Report
## Test Summary
| Metric | With Schema | Without Schema |
|--------|-------------|----------------|
| Tests Run | {len(schema_rows)} | {len(no_schema_rows)} |
| Schema Valid | {schema_valid_rate:.0f}% | {no_schema_valid_rate:.0f}% |
| Avg Field Accuracy | {schema_accuracy:.0f}% | {no_schema_accuracy:.0f}% |
| Avg Latency | {schema_rows['time_ms'].mean():.0f}ms | {no_schema_rows['time_ms'].mean():.0f}ms |
| Avg Tokens | {schema_rows['tokens'].mean():.0f} | {no_schema_rows['tokens'].mean():.0f} |
## Conclusion
Structured Outputs deliver **{schema_valid_rate:.0f}% schema validity** vs. {no_schema_valid_rate:.0f}%
without enforcement. Field accuracy improves from {no_schema_accuracy:.0f}% to {schema_accuracy:.0f}%.
**Recommendation:** Enable Structured Outputs for all extraction pipelines.
The ~38% latency overhead is justified by zero parsing failures in production.
"""
print(report)
with open("lab-072/reliability_report.md", "w") as f:
f.write(report)
print("πΎ Saved to lab-072/reliability_report.md")
π Bug-Fix ExerciseΒΆ
The file lab-072/broken_structured.py contains 3 bugs that produce incorrect metrics. Can you find and fix them all?
Run the self-tests to see which ones fail:
You should see 3 failed tests. Each test corresponds to one bug:
| Test | What it checks | Hint |
|---|---|---|
| Test 1 | Schema success rate metric | Should check structured_output_valid, not json_parse_success |
| Test 2 | No-schema accuracy | Should filter by the no-schema model, not the schema model |
| Test 3 | Average tokens per model | Should filter by model before averaging |
Fix all 3 bugs, then re-run. When you see All passed!, you're done!
π§ Knowledge CheckΒΆ
Q1 (Multiple Choice): What distinguishes Structured Outputs from regular JSON mode?
- A) Structured Outputs are faster than JSON mode
- B) Structured Outputs guarantee the output matches a specific JSON Schema, not just valid JSON
- C) Structured Outputs work without an API key
- D) Structured Outputs use a different model architecture
β Reveal Answer
Correct: B) Structured Outputs guarantee the output matches a specific JSON Schema, not just valid JSON
JSON mode ensures valid JSON syntax (proper brackets, quotes, etc.), but the structure β which fields exist, their types, nesting β is not enforced. Structured Outputs constrain the decoder to match a specific schema, guaranteeing every field is present with the correct type.
Q2 (Multiple Choice): Which Python library integrates most seamlessly with OpenAI's Structured Outputs for schema definition?
- A) dataclasses
- B) marshmallow
- C) Pydantic
- D) attrs
β Reveal Answer
Correct: C) Pydantic
OpenAI's Python SDK directly accepts Pydantic BaseModel subclasses via the response_format parameter. The SDK converts the Pydantic model to a JSON Schema automatically, making schema definition as simple as writing a Python class.
Q3 (Run the Lab): What is the schema validity rate for schema-enforced extraction tests?
Run the Step 3 analysis on π₯ structured_outputs.csv and check the results.
β Reveal Answer
100%
All 10 schema-enforced tests (model=gpt-4o) have structured_output_valid=true. The constrained decoder guarantees that every output matches the defined JSON Schema β zero parsing or validation failures.
Q4 (Run the Lab): What is the schema validity rate for tests without schema enforcement?
Run the Step 3 analysis to compare.
β Reveal Answer
0%
All 5 no-schema tests (model=gpt-4o-no-schema) have structured_output_valid=false. Even though some produce parseable JSON (json_parse_success=true), they fail schema validation because fields are missing, have wrong types, or use unexpected key names.
Q5 (Run the Lab): What is the average field accuracy for schema-enforced tests (gpt-4o rows)?
Run the Step 4 analysis to calculate it.
β Reveal Answer
98%
The 10 schema-enforced tests have field accuracies of 100, 100, 100, 95, 100, 90, 100, 100, 100, and 95. The mean is (100+100+100+95+100+90+100+100+100+95) Γ· 10 = 98%.
SummaryΒΆ
| Topic | What You Learned |
|---|---|
| Structured Outputs | JSON Schema-constrained decoding that guarantees valid output |
| Schema Validity | 100% with enforcement vs. 0% without β eliminates parsing failures |
| Field Accuracy | 98% with schema vs. 68% without β structure improves content accuracy |
| Pydantic Integration | Define schemas as Python classes for seamless API integration |
| Latency Trade-off | ~38% overhead is justified by production reliability |
| Production Readiness | Zero parsing failures makes structured outputs essential for pipelines |