Web Scraping Path
From requests to production scrapers
A practical web-scraping path - HTTP, BeautifulSoup, Scrapy, Playwright, anti-detection, storage, and a scheduled scraper deployed to the cloud.
How the Web Serves Data
HTTP basics, HTML vs APIs vs JS-rendered pages, and how to spot which one you're scraping.
HTTP with Python
GET, POST, headers, cookies, sessions, and rate-limiting using the requests library (and a peek at HTTPX).
HTML Parsing with BeautifulSoup
Selectors, navigating the parse tree, robust extraction patterns, and dealing with messy markup.
Scrapy Fundamentals
Why Scrapy, spiders, items, item loaders, pipelines, and built-in concurrency.
Browser Automation
When you need a real browser - Playwright (and Selenium), waits, interactions, screenshots.
APIs & AJAX
Reverse-engineering AJAX endpoints, handling tokens / auth, and why hitting the API beats scraping HTML.
Anti-Detection
User agents, headers, proxies, rotation, fingerprinting, CAPTCHAs - and where the legal red lines are.
Storing Scraped Data
CSV, JSON, SQLite / Postgres - schemas, dedup, idempotency, and scaling beyond a flat file.
Scheduling & Deployment
cron, GitHub Actions, Docker, and small VPS deployment for a scraper that runs on its own.
Legal & Ethical Scraping
robots.txt, ToS, copyright, GDPR, public vs private data - what's legitimate and what's not.