Lesson 2 of 107 min read

HTTP with Python

Share:WhatsAppLinkedIn

What you'll build

By the end of this lesson you will have a reusable polite_get() function that handles timeouts, retries with exponential backoff, custom headers, and session reuse. This function will be the foundation for every synchronous scraper you write in this path. You will also know how to translate any request you see in DevTools into Python code in under a minute.

Concepts

GET requests with query parameters

Query parameters are the ?key=value pairs in a URL. You can pass them as a dictionary and requests will encode them correctly, no manual URL construction needed.

import requests

# These two are identical
r1 = requests.get("https://httpbin.org/get?page=2&per_page=50")
r2 = requests.get("https://httpbin.org/get", params={"page": 2, "per_page": 50})

print(r2.url)  # https://httpbin.org/get?page=2&per_page=50

Always use the params dict. It handles special characters (spaces, ampersands, Unicode) by URL-encoding them automatically.

POST requests with form data and JSON

POST is used when you submit a form or send data to an API. There are two common formats:

import requests

# Form-encoded (like submitting an HTML form)
r = requests.post(
    "https://httpbin.org/post",
    data={"username": "alice", "password": "secret"},
)
print(r.json()["form"])  # {'username': 'alice', 'password': 'secret'}

# JSON body (for REST APIs)
r = requests.post(
    "https://httpbin.org/post",
    json={"query": "machine learning", "limit": 10},
)
print(r.json()["json"])  # {'query': 'machine learning', 'limit': 10}

Use data= for HTML forms (sets Content-Type: application/x-www-form-urlencoded). Use json= for APIs (sets Content-Type: application/json and serialises the dict).

Custom headers

The most important header for scrapers is User-Agent. Many sites also require Accept, Accept-Language, or Referer headers to return the right response.

import requests

headers = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

r = requests.get("https://httpbin.org/headers", headers=headers)
print(r.json()["headers"]["User-Agent"])

To find the exact headers a real browser sends, open DevTools, go to Network, click any request, and look at the Request Headers panel. Copy them all into your script.

Sessions

A Session object persists cookies, headers, and connection pools across multiple requests. This is important for two reasons:

  1. Sites that require login set a session cookie on the first request. You need to carry that cookie on subsequent requests.
  2. Reusing the underlying TCP connection (HTTP keep-alive) is faster than creating a new connection per request.
import requests

session = requests.Session()

# Set default headers for all requests through this session
session.headers.update({
    "User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)",
})

# First request, server may set cookies
r1 = session.get("https://quotes.toscrape.com/login")
print("Cookies after first request:", dict(session.cookies))

# The session carries those cookies automatically
r2 = session.get("https://quotes.toscrape.com")
print("Status:", r2.status_code)

Use session.get() and session.post() instead of requests.get() whenever you make more than one request to the same site.

Timeouts and retries with backoff

A scraper that hangs on a slow server is a broken scraper. Always set timeouts. A scraper that crashes on the first transient network error is fragile. Add retries with exponential backoff.

import time
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

def make_session(
    retries: int = 3,
    backoff_factor: float = 1.0,
    status_forcelist: tuple = (429, 500, 502, 503, 504),
) -> requests.Session:
    session = requests.Session()
    retry = Retry(
        total=retries,
        backoff_factor=backoff_factor,   # sleep 1s, 2s, 4s between retries
        status_forcelist=status_forcelist,
        raise_on_status=False,
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session

session = make_session()
r = session.get("https://httpbin.org/status/500", timeout=(5, 30))
print(r.status_code)  # After 3 retries, still 500

backoff_factor=1.0 means: wait 1 * 2^(retry_number - 1) seconds. So 1 s, then 2 s, then 4 s. This is the minimum polite behaviour when a server returns 429 or 5xx.

Translating a DevTools request to Python

This is a skill that saves you an hour every time. Open DevTools, find the request you want to replicate, right-click it, and choose "Copy as cURL". You will get something like:

curl 'https://api.example.com/search?q=python' \
  -H 'accept: application/json' \
  -H 'authorization: Bearer eyJhbGci...' \
  -H 'user-agent: Mozilla/5.0 ...'

Translate to Python:

import requests

r = requests.get(
    "https://api.example.com/search",
    params={"q": "python"},
    headers={
        "accept": "application/json",
        "authorization": "Bearer eyJhbGci...",
        "user-agent": "Mozilla/5.0 ...",
    },
    timeout=10,
)
print(r.json())

There are also online tools like curlconverter.com that do this translation automatically.

Hands-on

Let us build a complete polite GET helper and use it to fetch pages from quotes.toscrape.com.

import time
import random
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry


def make_session(user_agent: str = None) -> requests.Session:
    """Create a requests Session with retry logic and sensible defaults."""
    session = requests.Session()

    session.headers.update({
        "User-Agent": user_agent or (
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36"
        ),
        "Accept-Language": "en-US,en;q=0.9",
        "Accept": "text/html,application/xhtml+xml,*/*;q=0.8",
    })

    retry = Retry(
        total=3,
        backoff_factor=1.5,
        status_forcelist=(429, 500, 502, 503, 504),
        allowed_methods=frozenset(["GET", "POST"]),
    )
    adapter = HTTPAdapter(max_retries=retry)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
    return session


def polite_get(
    session: requests.Session,
    url: str,
    min_delay: float = 1.0,
    max_delay: float = 3.0,
    **kwargs,
) -> requests.Response:
    """
    GET a URL with a random delay before each request and raise on 4xx/5xx.
    """
    delay = random.uniform(min_delay, max_delay)
    time.sleep(delay)

    response = session.get(url, timeout=(5, 30), **kwargs)
    response.raise_for_status()
    return response


# --- Main script ---
session = make_session()
base_url = "https://quotes.toscrape.com"

for page in range(1, 4):  # pages 1, 2, 3
    url = f"{base_url}/page/{page}/"
    try:
        r = polite_get(session, url)
        print(f"Page {page}: {r.status_code}, {len(r.text)} bytes")
    except requests.HTTPError as e:
        print(f"Page {page}: HTTP error, {e}")
    except requests.RequestException as e:
        print(f"Page {page}: Network error, {e}")

Expected output:

Page 1: 200, 11373 bytes
Page 2: 200, 11269 bytes
Page 3: 200, 11283 bytes

Now let us try a POST, logging in to quotes.toscrape.com (it has a dummy login form):

session = make_session()

# First, fetch the login page to get any CSRF token
login_url = "https://quotes.toscrape.com/login"
r = session.get(login_url, timeout=10)
print("Login page status:", r.status_code)

# Post credentials (this site accepts any username/password)
r = session.post(
    login_url,
    data={"username": "testuser", "password": "testpass"},
    timeout=10,
)
print("After login, redirected to:", r.url)
print("Cookies:", dict(session.cookies))

The session now carries the login cookie automatically for all subsequent requests.

Brief note on httpx for async

If you need to scrape many URLs concurrently, httpx with asyncio is faster than threading with requests. The API is nearly identical:

import asyncio
import httpx

async def fetch(client: httpx.AsyncClient, url: str) -> str:
    r = await client.get(url, timeout=10)
    r.raise_for_status()
    return r.text

async def main():
    urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 6)]
    async with httpx.AsyncClient() as client:
        pages = await asyncio.gather(*[fetch(client, u) for u in urls])
    for i, html in enumerate(pages, 1):
        print(f"Page {i}: {len(html)} bytes")

asyncio.run(main())

For this path we will use synchronous requests for clarity. Use httpx + asyncio when you need to scrape hundreds of URLs quickly and the site does not throttle you.

Common pitfalls

  • Not using a Session for multi-page scraping. Every requests.get() call outside a Session opens a new TCP connection. On a 1000-page crawl this is a measurable slowdown, and you lose cookies between pages.

  • Retry on 404. Adding 404 to status_forcelist will hammer a missing page three times and waste time. Only retry on transient server errors (5xx) and rate limits (429).

  • Timeout is two numbers. timeout=30 means 30 seconds for both connecting and reading. Prefer timeout=(5, 30), 5 seconds to connect, 30 seconds to receive the full response. A 30-second connection timeout means your script can hang for 30 s on a dead host.

  • raise_for_status() after retries. The retry adapter will retry 5xx responses but still return a Response object with a non-200 status if all retries fail. You still need to call raise_for_status() or check response.status_code after the call.

  • Posting JSON as form data. Using data={"key": "val"} when the API expects Content-Type: application/json sends URL-encoded form data, which the server usually rejects with a 400 or 422. Use json={"key": "val"} for JSON APIs.

  • Hardcoding credentials in your script. Put secrets in environment variables or a .env file, never in source code that ends up in git.

What to try next

  1. Use httpbin.org/delay/2 (which takes 2 seconds to respond) with timeout=(5, 1). Observe the ReadTimeout exception. Then adjust to timeout=(5, 5) and observe success.

  2. Extend polite_get() to accept a params argument and log the full URL (including query string) before each request.

  3. Visit a page on quotes.toscrape.com, use DevTools to copy the request as cURL, and translate it manually to Python. Verify you get the same response.

Test Your Knowledge

Take a quick quiz on this lesson

Start Quiz →

Prefer watching over reading?

Subscribe for free.

Subscribe on YouTube