How to Use Browser Use to Automate Browser Tasks With AI

April 18, 2026 · Editorial Team · 6 min read · browser-use browser-automation python-automation

Most browser automation tools ask you to write selectors, record clicks, or maintain a script that breaks every time the target site redesigns its nav. Browser Use takes a different approach: you describe what you want in plain English, an LLM interprets the current page state, and an agent executes the steps by controlling a real Chromium browser. No selectors, no click recordings.

Browser Use is an open-source Python library, which means you run it on your own machine or server, it costs you LLM API calls rather than a SaaS subscription, and you can read the source code when something behaves unexpectedly. It's genuinely useful for tasks that don't fit into structured API flows, but it's also not magic. Here's how to set it up and where it works well.

Prerequisites and Installation

You need Python 3.11 or higher, and either an OpenAI API key or an Anthropic API key depending on which model you want to use.

Create a virtual environment first (skipping this step leads to dependency conflicts):

python -m venv .venv
source .venv/bin/activate  # on Windows: .venv\Scripts\activate

Install Browser Use and Playwright (the browser controller):

pip install browser-use
playwright install chromium

The playwright install step downloads Chromium. It's about 150MB and only needs to run once per environment.

Set your API key as an environment variable:

export OPENAI_API_KEY="sk-..."
# or for Anthropic:
export ANTHROPIC_API_KEY="sk-ant-..."

Your First Task: Five Lines of Code

Here's the minimal working example:

import asyncio
from browser_use import Agent
from langchain_openai import ChatOpenAI

async def main():
    agent = Agent(
        task="Go to news.ycombinator.com and return the titles of the top 5 posts",
        llm=ChatOpenAI(model="gpt-4o"),
    )
    result = await agent.run()
    print(result)

asyncio.run(main())

Run this with python main.py. You'll see a Chromium window open, navigate to Hacker News, and the agent will start clicking around. The terminal shows a trace of what the LLM is deciding at each step. After 30-60 seconds, you'll get the titles printed to the console.

That's it for a basic task. The agent handles all the "find the element, read the text, return it" logic itself.

Writing Effective Task Descriptions

The quality of your task description directly affects how reliably the agent completes it. I've tested a lot of these and there's a clear pattern: specificity about the outcome matters more than specifying the steps.

Vague (breaks often):

Fill out the contact form on example.com

Better:

Go to example.com/contact. Fill in the Name field with "Jane Smith", 
the Email field with "[email protected]", and the Message field with 
"I'd like to schedule a demo for my team of 10 people." 
Click the Submit button. Confirm the form submitted successfully by 
checking for a success message on the page.

The extra detail gives the LLM clear success criteria. When it's vague, the agent sometimes decides it's done when it isn't, or tries steps that aren't necessary.

For multi-step tasks, break them into numbered steps in the task string:

task = """
1. Go to linkedin.com and log in (credentials are stored in the browser)
2. Navigate to the job search page and search for "product manager" in "San Francisco"
3. Filter results to show only jobs posted in the last 7 days
4. Return the titles and company names of the first 10 results as a list
"""

Practical Task Categories: Where It Works Well

Browser Use handles some task types much better than others.

Good fits:

Scraping pages that block API access and require real browser rendering (review sites, some job boards)
Filling forms on sites that have anti-bot detection that blocks Selenium
Navigation tasks on internal tools where you control the environment
One-off tasks where writing a full Selenium script isn't worth the time

Unreliable fits:

Tasks that require CAPTCHA solving (the agent will pause and wait, or fail)
Multi-factor authentication flows (same issue)
Very long multi-step tasks (15+ steps) with many conditional branches, where small errors compound
Sites with heavy JavaScript animations that change the DOM while the agent is reading it

Here's a simple comparison of when to reach for Browser Use vs alternatives:

Scenario	Better tool
Stable site with public API	Direct HTTP requests
Stable site, no API, same task daily	Playwright with written selectors
Unpredictable UI, one-off task	Browser Use
Need to handle login + complex nav	Browser Use with caution

Adding Memory and Custom Actions

For more sophisticated workflows, Browser Use supports custom action functions that the agent can call. This lets the agent do things like read from a file or write to a database mid-task.

from browser_use import Agent, Controller
from browser_use.browser.context import BrowserContext

controller = Controller()

@controller.action("Save the result to results.txt")
def save_result(result: str):
    with open("results.txt", "a") as f:
        f.write(result + "\n")
    return "Saved successfully"

agent = Agent(
    task="Go to example.com/blog and save the title of each post to results.txt",
    llm=ChatOpenAI(model="gpt-4o"),
    controller=controller,
)

The agent can now call your save_result function when it decides it has data worth saving. This is the pattern for building agents that accumulate results across multiple pages without relying on the final return value alone.

Controlling Cost and Speed

LLM calls are the main cost here. Each page interaction sends the current page state (DOM, screenshot, or both depending on config) to the model for a decision. A simple task might cost 5-10 API calls. A complex 15-step task can cost 30-50.

On GPT-4o, each call runs about $0.01-0.03 depending on page complexity. Most tasks cost under $0.20. On GPT-4o-mini, costs drop significantly, but accuracy on complex pages drops too. I've found GPT-4o is worth the extra cost for anything involving forms or multi-step navigation, but GPT-4o-mini handles simple scraping tasks just fine.

To reduce unnecessary calls, keep tasks focused. A task that opens five pages and collects data from each costs more than five separate small tasks, but it's also easier to coordinate in a single agent run. There's a tradeoff.

When It Fails (and What to Do)

The most common failure mode is the agent getting "stuck" on a page it doesn't understand. It will try the same action two or three times, decide it can't proceed, and return an error.

When this happens:

Check the terminal trace to see which step it failed on
Simplify the task description for that specific step
Add a hint like "if you see a cookie banner, dismiss it first"
Consider breaking the task into two separate agent runs with the intermediate result passed between them

For production use, wrap the agent in a retry loop with a maximum of 2-3 attempts. Browser Use tasks that fail on the first try often succeed on the second with the same exact task string, because the LLM samples differently each run.

for attempt in range(3):
    try:
        result = await agent.run()
        break
    except Exception as e:
        if attempt == 2:
            raise
        print(f"Attempt {attempt + 1} failed: {e}, retrying...")

Browser Use isn't production-hardened in the same way a commercial tool is, but for the task categories it handles well, it's faster to deploy than writing and maintaining a full Playwright test suite.