What is a Browser Agent?
A Browser Agent combines:- Large Language Models (LLMs) for reasoning and decision-making
- Browser Sessions for executing actions
- Vision capabilities to understand web pages
- Autonomous planning to complete multi-step tasks
Quick Start
Create and run an agent in a few lines:agent_quickstart.py
How Agents Work
1. Observation
The agent observes the current page state:- Visible elements and their properties
- Interactive components (buttons, forms, links)
- Text content and structure
- Current URL and page metadata
2. Reasoning
Using the LLM, the agent:- Understands the current page
- Plans the next action to complete the task
- Decides which element to interact with
- Determines when the task is complete
3. Action
The agent executes browser actions:- Navigate to URLs
- Click buttons and links
- Fill forms
- Extract data
- Scroll and interact with dynamic content
4. Iteration
This cycle repeats until:- The task is successfully completed
- Maximum steps are reached
- An error occurs that can’t be resolved
Agents vs Scripted Automation
Both agents and scripted automation run on browser sessions—the cloud browser infrastructure. The difference is how you control what happens in that session.| Aspect | Scripted Automation | Agent |
|---|---|---|
| Control | You write the code | AI decides each step |
| Flexibility | Fixed workflow | Adapts to changes |
| Speed | Fast (direct execution) | Slower (LLM reasoning per step) |
| Cost | Browser minutes only | Browser minutes + LLM calls |
| Reliability | Deterministic | Can vary based on page state |
| Use Case | Known, stable workflows | Unknown or dynamic workflows |
- You know the exact steps to take
- Speed and cost are critical
- The target pages rarely change
- You don’t know the exact steps
- Pages change frequently
- You need intelligent decision-making
You can combine both approaches: use an agent to figure out a workflow, then convert it to a function for faster, cheaper repeated execution.
Agent Capabilities
Agents come with powerful built-in capabilities:Structured Output
Get type-safe responses using Pydantic models
Vaults & Personas
Use credentials and identities in automations
Visual Understanding
Analyze images and visual page elements
Replay & Debugging
Debug with MP4 replays of agent execution
Agent Fallback
Automatic recovery from script failures
Batch Execution
Run multiple agents in parallel

