Browser Automation with Playwright

What you'll build

By the end of this lesson you will be able to launch a headless browser from Python, navigate to a JavaScript-rendered page, wait for the content to appear, extract data, interact with elements (clicks, form fills), and take screenshots for debugging. You will also know when a real browser is genuinely necessary and when it is overkill.

Concepts

When you actually need a real browser

A real browser is slower, heavier, and harder to scale than plain HTTP. Only reach for it when you genuinely need it:

The page renders all its content with JavaScript (SPA / React / Vue / Angular). curl shows an empty shell., Login requires JavaScript-based challenges (hCaptcha, fingerprint checks)., You need to interact with the page, click buttons, fill forms, trigger infinite scroll., The site checks for browser-only signals (canvas fingerprint, WebGL, Web Audio API).

If curl or requests returns the data you need, use that. Browser automation is a last resort, not a first choice.

Installing Playwright

pip install playwright
playwright install chromium   # installs the browser binaries

playwright install downloads browser binaries to a local directory. Run it once per machine. If you only need Chromium, playwright install chromium is faster than installing all three browsers.

Sync vs async API

Playwright for Python has two APIs:

sync_playwright, blocking, works in regular scripts and Jupyter notebooks.
async_playwright, non-blocking, works with asyncio.

For scraping scripts, the sync API is simpler:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://quotes.toscrape.com/js/")
    print(page.title())
    browser.close()

Locators and waits

A Locator is a lazy reference to an element, it is evaluated at the moment you interact with it. This is fundamentally different from Selenium's find_element which looks up the element immediately.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://quotes.toscrape.com/js/", wait_until="networkidle")

    # Locator, waits automatically when you act on it
    quotes = page.locator("div.quote")
    count = quotes.count()
    print(f"Found {count} quotes")

    first_text = quotes.first.locator("span.text").inner_text()
    print(first_text)

    browser.close()

wait_until="networkidle" waits until there have been no network requests for 500 ms, good for SPAs that fetch data on load. Alternatives:

"load", fires when the load event fires (DOM + most subresources)
"domcontentloaded", fires as soon as the DOM is parsed
"commit", fires as soon as the response starts arriving

For JavaScript-heavy pages, "networkidle" is the safest. For faster pages that you know finish quickly, "load" is fine.

Explicit waits

Never use time.sleep() in Playwright scripts. Use explicit waits that resolve the moment the condition is met:

# Wait for a specific element to appear
page.wait_for_selector("div.quote", timeout=10_000)  # 10 seconds max

# Wait for a network response matching a URL pattern
with page.expect_response("**/api/quotes*") as resp_info:
    page.goto("https://example.com/quotes")
response = resp_info.value
print(response.json())

# Wait for navigation to complete after a click
with page.expect_navigation():
    page.click("a.next-page")

Clicks, form fills, and key presses

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()

    # Navigate to login page
    page.goto("https://quotes.toscrape.com/login")

    # Fill form fields
    page.fill("input#username", "testuser")
    page.fill("input#password", "testpass")

    # Click the submit button
    page.click("input[type='submit']")

    # Wait for navigation after submit
    page.wait_for_url("https://quotes.toscrape.com/", timeout=5000)
    print("Logged in, current URL:", page.url)

    browser.close()

For keyboard interaction:

page.press("input#search", "Enter")        # press a key
page.keyboard.press("Control+A")           # keyboard shortcut
page.type("input#search", "machine learning", delay=50)  # type with delay

Screenshots and debugging

# Full-page screenshot
page.screenshot(path="debug.png", full_page=True)

# Screenshot of a specific element
page.locator("div.quote").first.screenshot(path="first_quote.png")

# Get the full HTML after JavaScript has run
html = page.content()

# Evaluate JavaScript directly
title = page.evaluate("() => document.title")

Always take a screenshot when your scraper cannot find an element, you will often find the page showed a cookie consent banner or a redirect that blocked the content.

Brief Selenium comparison

Selenium is older and works with all major browsers. Playwright is newer and generally faster, more reliable, and has a better Python API.

Feature	Playwright	Selenium
Install	`pip install playwright && playwright install`	`pip install selenium` + driver binary
Waits	Auto-wait on every action	Manual `WebDriverWait` required
Speed	Faster	Slower
Async	Native async API	Requires threading workarounds
Intercept requests	Yes, built-in	Not natively
Community	Growing fast	Established, more Stack Overflow answers

For new projects, start with Playwright. Use Selenium if you are maintaining existing code or need to test a site in Internet Explorer (yes, it still happens).

Hands-on

quotes.toscrape.com/js/ is a version of the quotes site that loads content via JavaScript. Regular requests gets an empty page. Let us scrape it with Playwright.

import json
from playwright.sync_api import sync_playwright


def scrape_js_quotes(max_pages: int = 3) -> list[dict]:
    results = []

    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context(
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            )
        )
        page = context.new_page()

        for page_num in range(1, max_pages + 1):
            url = f"https://quotes.toscrape.com/js/page/{page_num}/"
            page.goto(url, wait_until="networkidle", timeout=15_000)

            # Wait until at least one quote is visible
            page.wait_for_selector("div.quote", timeout=10_000)

            quote_divs = page.locator("div.quote")
            count = quote_divs.count()

            for i in range(count):
                q = quote_divs.nth(i)
                text   = q.locator("span.text").inner_text().strip()
                author = q.locator("small.author").inner_text().strip()
                tags   = [t.inner_text() for t in q.locator("a.tag").all()]
                results.append({"text": text, "author": author, "tags": tags, "page": page_num})

            print(f"Page {page_num}: scraped {count} quotes")

            # Check for next page
            next_btn = page.locator("li.next a")
            if next_btn.count() == 0:
                print("No more pages")
                break

        browser.close()

    return results


if __name__ == "__main__":
    quotes = scrape_js_quotes(max_pages=3)
    print(f"\nTotal: {len(quotes)} quotes")
    print(json.dumps(quotes[0], indent=2))

Expected output:

Page 1: scraped 10 quotes
Page 2: scraped 10 quotes
Page 3: scraped 10 quotes

Total: 30 quotes
{
  "text": "“The world as we have created it is a process of our thinking.”",
  "author": "Albert Einstein",
  "tags": ["change", "deep-thoughts", "thinking", "world"],
  "page": 1
}

Now verify that plain requests cannot see this content:

import requests

r = requests.get("https://quotes.toscrape.com/js/page/1/", timeout=10)
# Count how many quote divs appear in the raw HTML
count = r.text.count("class=\"quote\"")
print(f"Quotes found in raw HTML: {count}")  # 0

This confirms that the JavaScript-rendered path genuinely needs a browser.

Common pitfalls

Using time.sleep() instead of Playwright's waits. A fixed sleep is always either too short (race condition) or too long (slow). wait_for_selector() resolves the moment the element appears.
Not closing the browser in error paths. If an exception occurs before browser.close(), the browser process keeps running. Always use the with sync_playwright() as p: context manager, it closes the browser automatically.
Headless mode rendering differences. Some sites detect headless Chromium and show different content. Try headless=False to debug, then look at what differs. The fix is usually a missing user_agent or viewport size.
wait_until="networkidle" timing out. Some SPAs keep polling for analytics or websocket connections, so network never goes fully idle. Try "load" or wait explicitly for the content element instead: page.wait_for_selector("div.data-loaded").
Not setting viewport size. The default viewport may trigger mobile layouts. Set context = browser.new_context(viewport={"width": 1280, "height": 800}) for a desktop-sized browser.
Treating Playwright as a universal solution. Browser automation is 5-20x slower than plain HTTP and harder to scale. If the data is available via a hidden JSON API (check DevTools XHR), use that instead.

What to try next

Add page.screenshot(path=f"page_{page_num}.png") inside the loop and inspect the screenshots. Find one that shows unexpected content (banner, redirect, etc.).
Use page.expect_response() to intercept any XHR calls made by the JS-rendered quotes page. Do the API calls reveal any interesting undocumented endpoints?
Try launching with headless=False (a real visible browser window) and watch the scraper run. Then switch back to headless=True and add slow_mo=500 (500 ms delay between actions) to simulate watching without a visible window.