Lab 058: Browser Automation Agents with OpenAI CUAΒΆ
What You'll LearnΒΆ
- What OpenAI CUA (Computer-Using Agent) is β GPT-4o vision driving a real cloud browser via screenshots
- The architectural difference between CUA (screenshot-based) and Playwright (code-based selectors)
- When to use CUA vs Playwright β dynamic sites without stable selectors vs structured, well-known pages
- Design safety boundaries β URL allowlists, session time limits, and action confirmation
- Analyze web automation benchmarks comparing CUA and Playwright across difficulty levels
IntroductionΒΆ
OpenAI CUA operates a real browser through screenshots. The agent sees the rendered page as an image, reasons about what to do next, and sends structured actions (click coordinates, type text, scroll). This is fundamentally different from Playwright, which interacts with the page through code β CSS selectors, XPath queries, and programmatic API calls.
| Approach | How It "Sees" the Page | Interaction Method | Brittleness |
|---|---|---|---|
| CUA | Screenshots (pixels) | Click coordinates, keyboard input | Resilient to DOM changes; struggles with dynamic SPAs |
| Playwright | DOM / HTML structure | CSS selectors, XPath, API calls | Breaks when selectors change; fast and precise |
The ScenarioΒΆ
You are a Web Automation Engineer at OutdoorGear Inc. The team needs to automate tasks across multiple web properties β the e-commerce storefront, travel booking partners, support portal, and internal analytics dashboards. Some sites have stable, well-structured HTML; others are dynamic single-page applications with constantly changing selectors.
Your job is to evaluate CUA vs Playwright using a benchmark dataset of 10 tasks attempted by both methods, and recommend which approach to use for each scenario.
No Live Agent Required
This lab analyzes a pre-recorded benchmark dataset comparing CUA and Playwright results. You don't need an OpenAI API key or Playwright installation β all analysis is done locally with pandas. If you have API access, you can optionally extend the lab to run live CUA tasks.
PrerequisitesΒΆ
| Requirement | Why |
|---|---|
| Python 3.10+ | Run analysis scripts |
pandas library |
DataFrame operations |
| (Optional) OpenAI API key | For live CUA experiments |
| (Optional) Playwright | For live browser automation comparison |
π¦ Supporting FilesΒΆ
Download these files before starting the lab
Save all files to a lab-058/ folder in your working directory.
| File | Description | Download |
|---|---|---|
broken_cua.py |
Bug-fix exercise (3 bugs + self-tests) | π₯ Download |
browser_tasks.csv |
Dataset | π₯ Download |
Step 1: Understanding CUA vs PlaywrightΒΆ
CUA ArchitectureΒΆ
βββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Browser ββββββΆβ GPT-4o ββββββΆβ Browser β
β Screenshot β β Vision β β Action β
β (pixels) β β (reason) β β (click/type)β
βββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β² β
ββββββββββββββββββββββββββββββββββββββββββ
repeat until done
CUA sends screenshots to GPT-4o, which returns structured actions. The browser executes the action, takes a new screenshot, and the loop continues until the task is complete.
Playwright ArchitectureΒΆ
βββββββββββββββ ββββββββββββββββ ββββββββββββββββ
β Test Script ββββββΆβ Browser ββββββΆβ DOM / HTML β
β (code) β β Engine β β (selectors) β
βββββββββββββββ ββββββββββββββββ ββββββββββββββββ
Playwright executes pre-written code that targets specific HTML elements using CSS selectors, XPath, or ARIA roles. It's fast, precise, and deterministic β but breaks when the page structure changes.
When to Use EachΒΆ
| Scenario | Best Approach | Why |
|---|---|---|
| Stable, well-structured site | Playwright | Selectors are reliable; faster and cheaper |
| Dynamic SPA with changing selectors | CUA | Vision-based; doesn't depend on DOM structure |
| CAPTCHA-protected pages | CUA | Can "see" and reason about CAPTCHAs |
| High-volume, repetitive tasks | Playwright | Faster execution; no API cost per action |
| Unknown/new site exploration | CUA | No pre-written selectors needed |
Key Difference
CUA uses vision and screenshots to understand the page β like a human looking at a screen. Playwright uses code and selectors β like a developer inspecting the HTML source. CUA is more flexible; Playwright is more reliable on known pages.
Step 2: Load the Benchmark DatasetΒΆ
The dataset contains 10 tasks, each attempted by both CUA and Playwright:
import pandas as pd
tasks = pd.read_csv("lab-058/browser_tasks.csv")
print(f"Total rows: {len(tasks)}")
print(f"Unique tasks: {tasks['task_id'].nunique()}")
print(f"Website types: {sorted(tasks['website_type'].unique())}")
print(f"Difficulty levels: {sorted(tasks['difficulty'].unique())}")
print(f"\nDataset preview:")
print(tasks[["task_id", "task_description", "difficulty",
"cua_completed", "playwright_completed"]].to_string(index=False))
Expected output:
Total rows: 10
Unique tasks: 10
Website types: ['auth', 'data', 'e-commerce', 'support', 'travel', 'webapp']
Difficulty levels: ['easy', 'hard', 'medium']
| task_id | task_description | difficulty | cua | playwright |
|---|---|---|---|---|
| T01 | Search for hiking boots and filter by price | easy | β | β |
| T02 | Add a product to cart and view cart total | easy | β | β |
| T03 | Fill out a shipping address form | medium | β | β |
| ... | ... | ... | ... | ... |
| T10 | Navigate a dynamic SPA with client-side routing | hard | β | β |
Step 3: Compare CUA vs Playwright Success RatesΒΆ
Calculate and compare completion rates for both methods:
cua_completed = tasks["cua_completed"].sum()
pw_completed = tasks["playwright_completed"].sum()
total = len(tasks)
cua_rate = (cua_completed / total) * 100
pw_rate = (pw_completed / total) * 100
print(f"CUA: {cua_completed}/{total} = {cua_rate:.0f}%")
print(f"Playwright: {pw_completed}/{total} = {pw_rate:.0f}%")
print(f"Difference: {pw_rate - cua_rate:.0f} percentage points in Playwright's favor")
Expected output:
Where Each Method ExcelsΒΆ
# Tasks where CUA succeeded but Playwright failed
cua_only = tasks[(tasks["cua_completed"] == True) & (tasks["playwright_completed"] == False)]
print(f"CUA succeeded, Playwright failed ({len(cua_only)}):")
print(cua_only[["task_id", "task_description"]].to_string(index=False))
# Tasks where Playwright succeeded but CUA failed
pw_only = tasks[(tasks["playwright_completed"] == True) & (tasks["cua_completed"] == False)]
print(f"\nPlaywright succeeded, CUA failed ({len(pw_only)}):")
print(pw_only[["task_id", "task_description"]].to_string(index=False))
Expected:
- CUA only: T07 (Submit a support ticket with screenshot attachment) β dynamic form with file upload that's hard to script with selectors
- Playwright only: T06 (Compare hotel prices across 3 tabs), T10 (Navigate a dynamic SPA) β structured tasks where code-based navigation is more reliable
Insight
Playwright has a higher overall success rate (80% vs 70%), but CUA wins on tasks that involve dynamic content or visual reasoning (like attaching screenshots to support tickets). Playwright excels at structured, multi-tab workflows where precise selector-based navigation is needed.
Step 4: Analyze by DifficultyΒΆ
Break down success rates by difficulty level:
print("Success rates by difficulty:\n")
for diff in ["easy", "medium", "hard"]:
subset = tasks[tasks["difficulty"] == diff]
cua_r = (subset["cua_completed"].sum() / len(subset)) * 100
pw_r = (subset["playwright_completed"].sum() / len(subset)) * 100
print(f" {diff.upper()} ({len(subset)} tasks):")
print(f" CUA: {subset['cua_completed'].sum()}/{len(subset)} = {cua_r:.0f}%")
print(f" Playwright: {subset['playwright_completed'].sum()}/{len(subset)} = {pw_r:.0f}%")
print()
Expected output:
Success rates by difficulty:
EASY (2 tasks):
CUA: 2/2 = 100%
Playwright: 2/2 = 100%
MEDIUM (3 tasks):
CUA: 3/3 = 100%
Playwright: 3/3 = 100%
HARD (5 tasks):
CUA: 2/5 = 40%
Playwright: 3/5 = 60%
Insight
Both methods handle easy and medium tasks perfectly (100%). The gap appears in hard tasks where Playwright's selector-based approach has a slight edge (60% vs 40%). However, the tasks where CUA wins (T07) are precisely the ones where Playwright's selectors can't handle dynamic, visual content.
Step 5: Screenshot AnalysisΒΆ
CUA takes screenshots at every step β more screenshots generally means a harder or longer task:
total_screenshots = tasks["cua_screenshots"].sum()
print(f"Total CUA screenshots across all tasks: {total_screenshots}")
print(f"\nScreenshots per task:")
print(tasks[["task_id", "task_description", "difficulty",
"cua_screenshots", "cua_completed"]].to_string(index=False))
avg_by_diff = tasks.groupby("difficulty")["cua_screenshots"].mean()
print(f"\nAverage screenshots by difficulty:")
print(avg_by_diff.to_string())
Expected output:
| task_id | difficulty | screenshots | completed |
|---|---|---|---|
| T01 | easy | 3 | True |
| T02 | easy | 5 | True |
| T03 | medium | 8 | True |
| T04 | medium | 6 | True |
| T05 | medium | 10 | True |
| T06 | hard | 18 | False |
| T07 | hard | 14 | True |
| T08 | hard | 16 | False |
| T09 | hard | 22 | True |
| T10 | hard | 20 | False |
Screenshot Cost
Each screenshot is sent to GPT-4o as an image token β at ~765 tokens per screenshot (typical web page), 122 screenshots β 93,000 tokens. At GPT-4o pricing, this is roughly $0.47 in input tokens for the entire benchmark run. CUA is cost-effective for moderate workloads but can add up for high-volume tasks.
Step 6: Safety ConsiderationsΒΆ
URL AllowlistΒΆ
Restrict CUA to approved domains:
# Analyze domain patterns in the dataset
print("URL patterns in tasks:")
print(tasks["url_pattern"].value_counts().to_string())
internal = tasks[tasks["url_pattern"] != "external"]
external = tasks[tasks["url_pattern"] == "external"]
print(f"\nInternal domains: {len(internal)} tasks")
print(f"External domains: {len(external)} tasks")
high_risk = tasks[tasks["safety_risk"] == "high"]
print(f"\nHigh-risk tasks: {len(high_risk)}")
print(high_risk[["task_id", "task_description", "safety_risk", "url_pattern"]].to_string(index=False))
Recommended Safety BoundariesΒΆ
| Boundary | Purpose | Implementation |
|---|---|---|
| URL allowlist | Restrict which sites CUA can visit | allowed_domains = ["*.outdoorgear.com"] |
| Session time limit | Prevent runaway agents | Kill session after 5 minutes of inactivity |
| Action confirmation | Human approval for risky actions | Prompt before form submissions on payment pages |
| Screenshot retention | Audit trail | Save all screenshots with timestamps for review |
| Credential handling | Never expose passwords in screenshots | Use browser-level autofill; keep passwords out of visible fields |
External Sites
Task T10 targets an external domain (external). In production, CUA should never be pointed at external sites without explicit allowlisting. An unconstrained agent could navigate to phishing sites, download malware, or leak sensitive data through form submissions on untrusted domains.
π Bug-Fix ExerciseΒΆ
The file lab-058/broken_cua.py has 3 bugs in the CUA analysis functions. Can you find and fix them all?
Run the self-tests to see which ones fail:
You should see 3 failed tests. Each test corresponds to one bug:
| Test | What it checks | Hint |
|---|---|---|
| Test 1 | CUA success rate | Should use cua_completed column, not playwright_completed |
| Test 2 | Total CUA screenshots | Should use sum(), not max() |
| Test 3 | CUA success rate by difficulty | Must filter by the difficulty parameter before computing rate |
Fix all 3 bugs, then re-run. When you see π All 3 tests passed, you're done!
π§ Knowledge CheckΒΆ
Q1 (Multiple Choice): What is the key difference between CUA and Playwright for browser automation?
- A) CUA is faster because it skips page rendering
- B) CUA uses vision/screenshots to understand pages, while Playwright uses code-based CSS selectors
- C) Playwright can handle CAPTCHAs but CUA cannot
- D) CUA requires access to the page's HTML source code
β Reveal Answer
Correct: B) CUA uses vision/screenshots to understand pages, while Playwright uses code-based CSS selectors
CUA sends screenshots to a vision-language model (GPT-4o) and receives click/type actions based on what it "sees" β just like a human looking at a screen. Playwright interacts with the DOM directly using CSS selectors, XPath, or ARIA roles. This fundamental difference means CUA is more flexible (works on any visual interface) while Playwright is more precise (direct DOM access).
Q2 (Multiple Choice): When is CUA a better choice than Playwright?
- A) For high-volume, repetitive tasks on stable pages
- B) For dynamic sites without stable CSS selectors
- C) When you need deterministic, reproducible test results
- D) When the page has a well-documented API
β Reveal Answer
Correct: B) For dynamic sites without stable CSS selectors
CUA excels on sites where the DOM structure changes frequently β dynamic SPAs, sites with A/B testing, or pages with randomized element IDs. Because CUA "sees" the page visually, it doesn't depend on CSS selectors that might break with every deployment. Playwright is better for stable, well-structured sites where selectors are reliable.
Q3 (Run the Lab): What is the CUA success rate?
Count tasks where cua_completed == True and divide by total tasks.
β Reveal Answer
70%
7 out of 10 tasks were completed successfully by CUA. The 3 failures (T06, T08, T10) were all hard difficulty tasks involving multi-tab comparison, CAPTCHA handling, and dynamic SPA navigation.
Q4 (Run the Lab): What is the Playwright success rate?
Count tasks where playwright_completed == True and divide by total tasks.
β Reveal Answer
80%
8 out of 10 tasks were completed successfully by Playwright. The 2 failures (T07, T08) involved a screenshot-attachment upload (which requires visual reasoning beyond selectors) and a CAPTCHA-protected form (which neither method could handle).
Q5 (Run the Lab): What is the total number of CUA screenshots across all tasks?
Compute tasks["cua_screenshots"].sum().
β Reveal Answer
122
Sum of all screenshots: 3 + 5 + 8 + 6 + 10 + 18 + 14 + 16 + 22 + 20 = 122 screenshots. Hard tasks required significantly more screenshots (avg 18) compared to easy tasks (avg 4), reflecting the additional reasoning steps needed for complex workflows.
SummaryΒΆ
| Topic | What You Learned |
|---|---|
| CUA Architecture | GPT-4o vision drives a cloud browser via screenshotβaction loop |
| Playwright Architecture | Code-based selectors interact directly with the DOM |
| CUA vs Playwright | CUA: 70% success, flexible; Playwright: 80% success, precise |
| Difficulty Impact | Both methods ace easy/medium; hard tasks reveal their differences |
| Screenshot Overhead | 122 total screenshots; hard tasks require 4Γ more than easy |
| Safety Design | URL allowlists, session limits, credential isolation, audit trails |
Next StepsΒΆ
- Lab 057 β Computer-Using Agents for Desktop Automation
- Explore OpenAI's CUA documentation for live agent setup
- Try Playwright for code-based browser automation