Skip to content

Lab 058: Browser Automation Agents with OpenAI CUAΒΆ

Level: L300 Path: βš™οΈ Pro Code Time: ~90 min πŸ’° Cost: Free β€” Uses benchmark dataset; OpenAI API optional

What You'll LearnΒΆ

  • What OpenAI CUA (Computer-Using Agent) is β€” GPT-4o vision driving a real cloud browser via screenshots
  • The architectural difference between CUA (screenshot-based) and Playwright (code-based selectors)
  • When to use CUA vs Playwright β€” dynamic sites without stable selectors vs structured, well-known pages
  • Design safety boundaries β€” URL allowlists, session time limits, and action confirmation
  • Analyze web automation benchmarks comparing CUA and Playwright across difficulty levels

IntroductionΒΆ

OpenAI CUA operates a real browser through screenshots. The agent sees the rendered page as an image, reasons about what to do next, and sends structured actions (click coordinates, type text, scroll). This is fundamentally different from Playwright, which interacts with the page through code β€” CSS selectors, XPath queries, and programmatic API calls.

Approach How It "Sees" the Page Interaction Method Brittleness
CUA Screenshots (pixels) Click coordinates, keyboard input Resilient to DOM changes; struggles with dynamic SPAs
Playwright DOM / HTML structure CSS selectors, XPath, API calls Breaks when selectors change; fast and precise

The ScenarioΒΆ

You are a Web Automation Engineer at OutdoorGear Inc. The team needs to automate tasks across multiple web properties β€” the e-commerce storefront, travel booking partners, support portal, and internal analytics dashboards. Some sites have stable, well-structured HTML; others are dynamic single-page applications with constantly changing selectors.

Your job is to evaluate CUA vs Playwright using a benchmark dataset of 10 tasks attempted by both methods, and recommend which approach to use for each scenario.

No Live Agent Required

This lab analyzes a pre-recorded benchmark dataset comparing CUA and Playwright results. You don't need an OpenAI API key or Playwright installation β€” all analysis is done locally with pandas. If you have API access, you can optionally extend the lab to run live CUA tasks.

PrerequisitesΒΆ

Requirement Why
Python 3.10+ Run analysis scripts
pandas library DataFrame operations
(Optional) OpenAI API key For live CUA experiments
(Optional) Playwright For live browser automation comparison
pip install pandas

Quick Start with GitHub Codespaces

Open in GitHub Codespaces

All dependencies are pre-installed in the devcontainer.

πŸ“¦ Supporting FilesΒΆ

Download these files before starting the lab

Save all files to a lab-058/ folder in your working directory.

File Description Download
broken_cua.py Bug-fix exercise (3 bugs + self-tests) πŸ“₯ Download
browser_tasks.csv Dataset πŸ“₯ Download

Step 1: Understanding CUA vs PlaywrightΒΆ

CUA ArchitectureΒΆ

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Browser     │────▢│  GPT-4o      │────▢│  Browser     β”‚
β”‚  Screenshot  β”‚     β”‚  Vision      β”‚     β”‚  Action      β”‚
β”‚  (pixels)    β”‚     β”‚  (reason)    β”‚     β”‚  (click/type)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β–²                                        β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    repeat until done

CUA sends screenshots to GPT-4o, which returns structured actions. The browser executes the action, takes a new screenshot, and the loop continues until the task is complete.

Playwright ArchitectureΒΆ

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Test Script │────▢│  Browser     │────▢│  DOM / HTML  β”‚
β”‚  (code)      β”‚     β”‚  Engine      β”‚     β”‚  (selectors) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Playwright executes pre-written code that targets specific HTML elements using CSS selectors, XPath, or ARIA roles. It's fast, precise, and deterministic β€” but breaks when the page structure changes.

When to Use EachΒΆ

Scenario Best Approach Why
Stable, well-structured site Playwright Selectors are reliable; faster and cheaper
Dynamic SPA with changing selectors CUA Vision-based; doesn't depend on DOM structure
CAPTCHA-protected pages CUA Can "see" and reason about CAPTCHAs
High-volume, repetitive tasks Playwright Faster execution; no API cost per action
Unknown/new site exploration CUA No pre-written selectors needed

Key Difference

CUA uses vision and screenshots to understand the page β€” like a human looking at a screen. Playwright uses code and selectors β€” like a developer inspecting the HTML source. CUA is more flexible; Playwright is more reliable on known pages.


Step 2: Load the Benchmark DatasetΒΆ

The dataset contains 10 tasks, each attempted by both CUA and Playwright:

import pandas as pd

tasks = pd.read_csv("lab-058/browser_tasks.csv")
print(f"Total rows: {len(tasks)}")
print(f"Unique tasks: {tasks['task_id'].nunique()}")
print(f"Website types: {sorted(tasks['website_type'].unique())}")
print(f"Difficulty levels: {sorted(tasks['difficulty'].unique())}")
print(f"\nDataset preview:")
print(tasks[["task_id", "task_description", "difficulty",
             "cua_completed", "playwright_completed"]].to_string(index=False))

Expected output:

Total rows: 10
Unique tasks: 10
Website types: ['auth', 'data', 'e-commerce', 'support', 'travel', 'webapp']
Difficulty levels: ['easy', 'hard', 'medium']
task_id task_description difficulty cua playwright
T01 Search for hiking boots and filter by price easy βœ“ βœ“
T02 Add a product to cart and view cart total easy βœ“ βœ“
T03 Fill out a shipping address form medium βœ“ βœ“
... ... ... ... ...
T10 Navigate a dynamic SPA with client-side routing hard βœ— βœ“

Step 3: Compare CUA vs Playwright Success RatesΒΆ

Calculate and compare completion rates for both methods:

cua_completed = tasks["cua_completed"].sum()
pw_completed = tasks["playwright_completed"].sum()
total = len(tasks)

cua_rate = (cua_completed / total) * 100
pw_rate = (pw_completed / total) * 100

print(f"CUA:        {cua_completed}/{total} = {cua_rate:.0f}%")
print(f"Playwright: {pw_completed}/{total} = {pw_rate:.0f}%")
print(f"Difference: {pw_rate - cua_rate:.0f} percentage points in Playwright's favor")

Expected output:

CUA:        7/10 = 70%
Playwright: 8/10 = 80%
Difference: 10 percentage points in Playwright's favor

Where Each Method ExcelsΒΆ

# Tasks where CUA succeeded but Playwright failed
cua_only = tasks[(tasks["cua_completed"] == True) & (tasks["playwright_completed"] == False)]
print(f"CUA succeeded, Playwright failed ({len(cua_only)}):")
print(cua_only[["task_id", "task_description"]].to_string(index=False))

# Tasks where Playwright succeeded but CUA failed
pw_only = tasks[(tasks["playwright_completed"] == True) & (tasks["cua_completed"] == False)]
print(f"\nPlaywright succeeded, CUA failed ({len(pw_only)}):")
print(pw_only[["task_id", "task_description"]].to_string(index=False))

Expected:

  • CUA only: T07 (Submit a support ticket with screenshot attachment) β€” dynamic form with file upload that's hard to script with selectors
  • Playwright only: T06 (Compare hotel prices across 3 tabs), T10 (Navigate a dynamic SPA) β€” structured tasks where code-based navigation is more reliable

Insight

Playwright has a higher overall success rate (80% vs 70%), but CUA wins on tasks that involve dynamic content or visual reasoning (like attaching screenshots to support tickets). Playwright excels at structured, multi-tab workflows where precise selector-based navigation is needed.


Step 4: Analyze by DifficultyΒΆ

Break down success rates by difficulty level:

print("Success rates by difficulty:\n")
for diff in ["easy", "medium", "hard"]:
    subset = tasks[tasks["difficulty"] == diff]
    cua_r = (subset["cua_completed"].sum() / len(subset)) * 100
    pw_r = (subset["playwright_completed"].sum() / len(subset)) * 100
    print(f"  {diff.upper()} ({len(subset)} tasks):")
    print(f"    CUA:        {subset['cua_completed'].sum()}/{len(subset)} = {cua_r:.0f}%")
    print(f"    Playwright: {subset['playwright_completed'].sum()}/{len(subset)} = {pw_r:.0f}%")
    print()

Expected output:

Success rates by difficulty:

  EASY (2 tasks):
    CUA:        2/2 = 100%
    Playwright: 2/2 = 100%

  MEDIUM (3 tasks):
    CUA:        3/3 = 100%
    Playwright: 3/3 = 100%

  HARD (5 tasks):
    CUA:        2/5 = 40%
    Playwright: 3/5 = 60%

Insight

Both methods handle easy and medium tasks perfectly (100%). The gap appears in hard tasks where Playwright's selector-based approach has a slight edge (60% vs 40%). However, the tasks where CUA wins (T07) are precisely the ones where Playwright's selectors can't handle dynamic, visual content.


Step 5: Screenshot AnalysisΒΆ

CUA takes screenshots at every step β€” more screenshots generally means a harder or longer task:

total_screenshots = tasks["cua_screenshots"].sum()
print(f"Total CUA screenshots across all tasks: {total_screenshots}")

print(f"\nScreenshots per task:")
print(tasks[["task_id", "task_description", "difficulty",
             "cua_screenshots", "cua_completed"]].to_string(index=False))

avg_by_diff = tasks.groupby("difficulty")["cua_screenshots"].mean()
print(f"\nAverage screenshots by difficulty:")
print(avg_by_diff.to_string())

Expected output:

Total CUA screenshots across all tasks: 122
task_id difficulty screenshots completed
T01 easy 3 True
T02 easy 5 True
T03 medium 8 True
T04 medium 6 True
T05 medium 10 True
T06 hard 18 False
T07 hard 14 True
T08 hard 16 False
T09 hard 22 True
T10 hard 20 False
Average screenshots by difficulty:
easy       4.0
medium     8.0
hard      18.0

Screenshot Cost

Each screenshot is sent to GPT-4o as an image token β€” at ~765 tokens per screenshot (typical web page), 122 screenshots β‰ˆ 93,000 tokens. At GPT-4o pricing, this is roughly $0.47 in input tokens for the entire benchmark run. CUA is cost-effective for moderate workloads but can add up for high-volume tasks.


Step 6: Safety ConsiderationsΒΆ

URL AllowlistΒΆ

Restrict CUA to approved domains:

# Analyze domain patterns in the dataset
print("URL patterns in tasks:")
print(tasks["url_pattern"].value_counts().to_string())

internal = tasks[tasks["url_pattern"] != "external"]
external = tasks[tasks["url_pattern"] == "external"]
print(f"\nInternal domains: {len(internal)} tasks")
print(f"External domains: {len(external)} tasks")

high_risk = tasks[tasks["safety_risk"] == "high"]
print(f"\nHigh-risk tasks: {len(high_risk)}")
print(high_risk[["task_id", "task_description", "safety_risk", "url_pattern"]].to_string(index=False))
Boundary Purpose Implementation
URL allowlist Restrict which sites CUA can visit allowed_domains = ["*.outdoorgear.com"]
Session time limit Prevent runaway agents Kill session after 5 minutes of inactivity
Action confirmation Human approval for risky actions Prompt before form submissions on payment pages
Screenshot retention Audit trail Save all screenshots with timestamps for review
Credential handling Never expose passwords in screenshots Use browser-level autofill; keep passwords out of visible fields

External Sites

Task T10 targets an external domain (external). In production, CUA should never be pointed at external sites without explicit allowlisting. An unconstrained agent could navigate to phishing sites, download malware, or leak sensitive data through form submissions on untrusted domains.


πŸ› Bug-Fix ExerciseΒΆ

The file lab-058/broken_cua.py has 3 bugs in the CUA analysis functions. Can you find and fix them all?

Run the self-tests to see which ones fail:

python lab-058/broken_cua.py

You should see 3 failed tests. Each test corresponds to one bug:

Test What it checks Hint
Test 1 CUA success rate Should use cua_completed column, not playwright_completed
Test 2 Total CUA screenshots Should use sum(), not max()
Test 3 CUA success rate by difficulty Must filter by the difficulty parameter before computing rate

Fix all 3 bugs, then re-run. When you see πŸŽ‰ All 3 tests passed, you're done!


🧠 Knowledge Check¢

Q1 (Multiple Choice): What is the key difference between CUA and Playwright for browser automation?
  • A) CUA is faster because it skips page rendering
  • B) CUA uses vision/screenshots to understand pages, while Playwright uses code-based CSS selectors
  • C) Playwright can handle CAPTCHAs but CUA cannot
  • D) CUA requires access to the page's HTML source code
βœ… Reveal Answer

Correct: B) CUA uses vision/screenshots to understand pages, while Playwright uses code-based CSS selectors

CUA sends screenshots to a vision-language model (GPT-4o) and receives click/type actions based on what it "sees" β€” just like a human looking at a screen. Playwright interacts with the DOM directly using CSS selectors, XPath, or ARIA roles. This fundamental difference means CUA is more flexible (works on any visual interface) while Playwright is more precise (direct DOM access).

Q2 (Multiple Choice): When is CUA a better choice than Playwright?
  • A) For high-volume, repetitive tasks on stable pages
  • B) For dynamic sites without stable CSS selectors
  • C) When you need deterministic, reproducible test results
  • D) When the page has a well-documented API
βœ… Reveal Answer

Correct: B) For dynamic sites without stable CSS selectors

CUA excels on sites where the DOM structure changes frequently β€” dynamic SPAs, sites with A/B testing, or pages with randomized element IDs. Because CUA "sees" the page visually, it doesn't depend on CSS selectors that might break with every deployment. Playwright is better for stable, well-structured sites where selectors are reliable.

Q3 (Run the Lab): What is the CUA success rate?

Count tasks where cua_completed == True and divide by total tasks.

βœ… Reveal Answer

70%

7 out of 10 tasks were completed successfully by CUA. The 3 failures (T06, T08, T10) were all hard difficulty tasks involving multi-tab comparison, CAPTCHA handling, and dynamic SPA navigation.

Q4 (Run the Lab): What is the Playwright success rate?

Count tasks where playwright_completed == True and divide by total tasks.

βœ… Reveal Answer

80%

8 out of 10 tasks were completed successfully by Playwright. The 2 failures (T07, T08) involved a screenshot-attachment upload (which requires visual reasoning beyond selectors) and a CAPTCHA-protected form (which neither method could handle).

Q5 (Run the Lab): What is the total number of CUA screenshots across all tasks?

Compute tasks["cua_screenshots"].sum().

βœ… Reveal Answer

122

Sum of all screenshots: 3 + 5 + 8 + 6 + 10 + 18 + 14 + 16 + 22 + 20 = 122 screenshots. Hard tasks required significantly more screenshots (avg 18) compared to easy tasks (avg 4), reflecting the additional reasoning steps needed for complex workflows.


SummaryΒΆ

Topic What You Learned
CUA Architecture GPT-4o vision drives a cloud browser via screenshot→action loop
Playwright Architecture Code-based selectors interact directly with the DOM
CUA vs Playwright CUA: 70% success, flexible; Playwright: 80% success, precise
Difficulty Impact Both methods ace easy/medium; hard tasks reveal their differences
Screenshot Overhead 122 total screenshots; hard tasks require 4Γ— more than easy
Safety Design URL allowlists, session limits, credential isolation, audit trails

Next StepsΒΆ