Web Scraping Path

From requests to production scrapers

A practical web-scraping path - HTTP, BeautifulSoup, Scrapy, Playwright, anti-detection, storage, and a scheduled scraper deployed to the cloud.

HTTP basics, HTML vs APIs vs JS-rendered pages, and how to spot which one you're scraping.

GET, POST, headers, cookies, sessions, and rate-limiting using the requests library (and a peek at HTTPX).

Selectors, navigating the parse tree, robust extraction patterns, and dealing with messy markup.

Why Scrapy, spiders, items, item loaders, pipelines, and built-in concurrency.

When you need a real browser - Playwright (and Selenium), waits, interactions, screenshots.

Reverse-engineering AJAX endpoints, handling tokens / auth, and why hitting the API beats scraping HTML.

User agents, headers, proxies, rotation, fingerprinting, CAPTCHAs - and where the legal red lines are.

CSV, JSON, SQLite / Postgres - schemas, dedup, idempotency, and scaling beyond a flat file.

cron, GitHub Actions, Docker, and small VPS deployment for a scraper that runs on its own.

robots.txt, ToS, copyright, GDPR, public vs private data - what's legitimate and what's not.