Browser Automation with Playwright
What you'll build
By the end of this lesson you will be able to launch a headless browser from Python, navigate to a JavaScript-rendered page, wait for the content to appear, extract data, interact with elements (clicks, form fills), and take screenshots for debugging. You will also know when a real browser is genuinely necessary and when it is overkill.
Concepts
When you actually need a real browser
A real browser is slower, heavier, and harder to scale than plain HTTP. Only reach for it when you genuinely need it:
- The page renders all its content with JavaScript (SPA / React / Vue / Angular).
curlshows an empty shell., Login requires JavaScript-based challenges (hCaptcha, fingerprint checks)., You need to interact with the page, click buttons, fill forms, trigger infinite scroll., The site checks for browser-only signals (canvas fingerprint, WebGL, Web Audio API).
If curl or requests returns the data you need, use that. Browser automation is a last resort, not a first choice.
Installing Playwright
pip install playwright
playwright install chromium # installs the browser binaries
playwright install downloads browser binaries to a local directory. Run it once per machine. If you only need Chromium, playwright install chromium is faster than installing all three browsers.
Sync vs async API
Playwright for Python has two APIs:
sync_playwright, blocking, works in regular scripts and Jupyter notebooks.async_playwright, non-blocking, works withasyncio.
For scraping scripts, the sync API is simpler:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://quotes.toscrape.com/js/")
print(page.title())
browser.close()
Locators and waits
A Locator is a lazy reference to an element, it is evaluated at the moment you interact with it. This is fundamentally different from Selenium's find_element which looks up the element immediately.
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://quotes.toscrape.com/js/", wait_until="networkidle")
# Locator, waits automatically when you act on it
quotes = page.locator("div.quote")
count = quotes.count()
print(f"Found {count} quotes")
first_text = quotes.first.locator("span.text").inner_text()
print(first_text)
browser.close()
wait_until="networkidle" waits until there have been no network requests for 500 ms, good for SPAs that fetch data on load. Alternatives:
"load", fires when theloadevent fires (DOM + most subresources)"domcontentloaded", fires as soon as the DOM is parsed"commit", fires as soon as the response starts arriving
For JavaScript-heavy pages, "networkidle" is the safest. For faster pages that you know finish quickly, "load" is fine.
Explicit waits
Never use time.sleep() in Playwright scripts. Use explicit waits that resolve the moment the condition is met:
# Wait for a specific element to appear
page.wait_for_selector("div.quote", timeout=10_000) # 10 seconds max
# Wait for a network response matching a URL pattern
with page.expect_response("**/api/quotes*") as resp_info:
page.goto("https://example.com/quotes")
response = resp_info.value
print(response.json())
# Wait for navigation to complete after a click
with page.expect_navigation():
page.click("a.next-page")
Clicks, form fills, and key presses
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
# Navigate to login page
page.goto("https://quotes.toscrape.com/login")
# Fill form fields
page.fill("input#username", "testuser")
page.fill("input#password", "testpass")
# Click the submit button
page.click("input[type='submit']")
# Wait for navigation after submit
page.wait_for_url("https://quotes.toscrape.com/", timeout=5000)
print("Logged in, current URL:", page.url)
browser.close()
For keyboard interaction:
page.press("input#search", "Enter") # press a key
page.keyboard.press("Control+A") # keyboard shortcut
page.type("input#search", "machine learning", delay=50) # type with delay
Screenshots and debugging
# Full-page screenshot
page.screenshot(path="debug.png", full_page=True)
# Screenshot of a specific element
page.locator("div.quote").first.screenshot(path="first_quote.png")
# Get the full HTML after JavaScript has run
html = page.content()
# Evaluate JavaScript directly
title = page.evaluate("() => document.title")
Always take a screenshot when your scraper cannot find an element, you will often find the page showed a cookie consent banner or a redirect that blocked the content.
Brief Selenium comparison
Selenium is older and works with all major browsers. Playwright is newer and generally faster, more reliable, and has a better Python API.
| Feature | Playwright | Selenium |
|---|---|---|
| Install | pip install playwright && playwright install |
pip install selenium + driver binary |
| Waits | Auto-wait on every action | Manual WebDriverWait required |
| Speed | Faster | Slower |
| Async | Native async API | Requires threading workarounds |
| Intercept requests | Yes, built-in | Not natively |
| Community | Growing fast | Established, more Stack Overflow answers |
For new projects, start with Playwright. Use Selenium if you are maintaining existing code or need to test a site in Internet Explorer (yes, it still happens).
Hands-on
quotes.toscrape.com/js/ is a version of the quotes site that loads content via JavaScript. Regular requests gets an empty page. Let us scrape it with Playwright.
import json
from playwright.sync_api import sync_playwright
def scrape_js_quotes(max_pages: int = 3) -> list[dict]:
results = []
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context(
user_agent=(
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
)
)
page = context.new_page()
for page_num in range(1, max_pages + 1):
url = f"https://quotes.toscrape.com/js/page/{page_num}/"
page.goto(url, wait_until="networkidle", timeout=15_000)
# Wait until at least one quote is visible
page.wait_for_selector("div.quote", timeout=10_000)
quote_divs = page.locator("div.quote")
count = quote_divs.count()
for i in range(count):
q = quote_divs.nth(i)
text = q.locator("span.text").inner_text().strip()
author = q.locator("small.author").inner_text().strip()
tags = [t.inner_text() for t in q.locator("a.tag").all()]
results.append({"text": text, "author": author, "tags": tags, "page": page_num})
print(f"Page {page_num}: scraped {count} quotes")
# Check for next page
next_btn = page.locator("li.next a")
if next_btn.count() == 0:
print("No more pages")
break
browser.close()
return results
if __name__ == "__main__":
quotes = scrape_js_quotes(max_pages=3)
print(f"\nTotal: {len(quotes)} quotes")
print(json.dumps(quotes[0], indent=2))
Expected output:
Page 1: scraped 10 quotes
Page 2: scraped 10 quotes
Page 3: scraped 10 quotes
Total: 30 quotes
{
"text": "“The world as we have created it is a process of our thinking.”",
"author": "Albert Einstein",
"tags": ["change", "deep-thoughts", "thinking", "world"],
"page": 1
}
Now verify that plain requests cannot see this content:
import requests
r = requests.get("https://quotes.toscrape.com/js/page/1/", timeout=10)
# Count how many quote divs appear in the raw HTML
count = r.text.count("class=\"quote\"")
print(f"Quotes found in raw HTML: {count}") # 0
This confirms that the JavaScript-rendered path genuinely needs a browser.
Common pitfalls
-
Using
time.sleep()instead of Playwright's waits. A fixed sleep is always either too short (race condition) or too long (slow).wait_for_selector()resolves the moment the element appears. -
Not closing the browser in error paths. If an exception occurs before
browser.close(), the browser process keeps running. Always use thewith sync_playwright() as p:context manager, it closes the browser automatically. -
Headless mode rendering differences. Some sites detect headless Chromium and show different content. Try
headless=Falseto debug, then look at what differs. The fix is usually a missinguser_agentor viewport size. -
wait_until="networkidle"timing out. Some SPAs keep polling for analytics or websocket connections, so network never goes fully idle. Try"load"or wait explicitly for the content element instead:page.wait_for_selector("div.data-loaded"). -
Not setting viewport size. The default viewport may trigger mobile layouts. Set
context = browser.new_context(viewport={"width": 1280, "height": 800})for a desktop-sized browser. -
Treating Playwright as a universal solution. Browser automation is 5-20x slower than plain HTTP and harder to scale. If the data is available via a hidden JSON API (check DevTools XHR), use that instead.
What to try next
-
Add
page.screenshot(path=f"page_{page_num}.png")inside the loop and inspect the screenshots. Find one that shows unexpected content (banner, redirect, etc.). -
Use
page.expect_response()to intercept any XHR calls made by the JS-rendered quotes page. Do the API calls reveal any interesting undocumented endpoints? -
Try launching with
headless=False(a real visible browser window) and watch the scraper run. Then switch back toheadless=Trueand addslow_mo=500(500 ms delay between actions) to simulate watching without a visible window.
Prefer watching over reading?
Subscribe for free.