HTTP with Python
What you'll build
By the end of this lesson you will have a reusable polite_get() function that handles timeouts, retries with exponential backoff, custom headers, and session reuse. This function will be the foundation for every synchronous scraper you write in this path. You will also know how to translate any request you see in DevTools into Python code in under a minute.
Concepts
GET requests with query parameters
Query parameters are the ?key=value pairs in a URL. You can pass them as a dictionary and requests will encode them correctly, no manual URL construction needed.
import requests
# These two are identical
r1 = requests.get("https://httpbin.org/get?page=2&per_page=50")
r2 = requests.get("https://httpbin.org/get", params={"page": 2, "per_page": 50})
print(r2.url) # https://httpbin.org/get?page=2&per_page=50
Always use the params dict. It handles special characters (spaces, ampersands, Unicode) by URL-encoding them automatically.
POST requests with form data and JSON
POST is used when you submit a form or send data to an API. There are two common formats:
import requests
# Form-encoded (like submitting an HTML form)
r = requests.post(
"https://httpbin.org/post",
data={"username": "alice", "password": "secret"},
)
print(r.json()["form"]) # {'username': 'alice', 'password': 'secret'}
# JSON body (for REST APIs)
r = requests.post(
"https://httpbin.org/post",
json={"query": "machine learning", "limit": 10},
)
print(r.json()["json"]) # {'query': 'machine learning', 'limit': 10}
Use data= for HTML forms (sets Content-Type: application/x-www-form-urlencoded). Use json= for APIs (sets Content-Type: application/json and serialises the dict).
Custom headers
The most important header for scrapers is User-Agent. Many sites also require Accept, Accept-Language, or Referer headers to return the right response.
import requests
headers = {
"User-Agent": (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
r = requests.get("https://httpbin.org/headers", headers=headers)
print(r.json()["headers"]["User-Agent"])
To find the exact headers a real browser sends, open DevTools, go to Network, click any request, and look at the Request Headers panel. Copy them all into your script.
Sessions
A Session object persists cookies, headers, and connection pools across multiple requests. This is important for two reasons:
- Sites that require login set a session cookie on the first request. You need to carry that cookie on subsequent requests.
- Reusing the underlying TCP connection (HTTP keep-alive) is faster than creating a new connection per request.
import requests
session = requests.Session()
# Set default headers for all requests through this session
session.headers.update({
"User-Agent": "Mozilla/5.0 (compatible; MyBot/1.0)",
})
# First request, server may set cookies
r1 = session.get("https://quotes.toscrape.com/login")
print("Cookies after first request:", dict(session.cookies))
# The session carries those cookies automatically
r2 = session.get("https://quotes.toscrape.com")
print("Status:", r2.status_code)
Use session.get() and session.post() instead of requests.get() whenever you make more than one request to the same site.
Timeouts and retries with backoff
A scraper that hangs on a slow server is a broken scraper. Always set timeouts. A scraper that crashes on the first transient network error is fragile. Add retries with exponential backoff.
import time
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def make_session(
retries: int = 3,
backoff_factor: float = 1.0,
status_forcelist: tuple = (429, 500, 502, 503, 504),
) -> requests.Session:
session = requests.Session()
retry = Retry(
total=retries,
backoff_factor=backoff_factor, # sleep 1s, 2s, 4s between retries
status_forcelist=status_forcelist,
raise_on_status=False,
)
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
session = make_session()
r = session.get("https://httpbin.org/status/500", timeout=(5, 30))
print(r.status_code) # After 3 retries, still 500
backoff_factor=1.0 means: wait 1 * 2^(retry_number - 1) seconds. So 1 s, then 2 s, then 4 s. This is the minimum polite behaviour when a server returns 429 or 5xx.
Translating a DevTools request to Python
This is a skill that saves you an hour every time. Open DevTools, find the request you want to replicate, right-click it, and choose "Copy as cURL". You will get something like:
curl 'https://api.example.com/search?q=python' \
-H 'accept: application/json' \
-H 'authorization: Bearer eyJhbGci...' \
-H 'user-agent: Mozilla/5.0 ...'
Translate to Python:
import requests
r = requests.get(
"https://api.example.com/search",
params={"q": "python"},
headers={
"accept": "application/json",
"authorization": "Bearer eyJhbGci...",
"user-agent": "Mozilla/5.0 ...",
},
timeout=10,
)
print(r.json())
There are also online tools like curlconverter.com that do this translation automatically.
Hands-on
Let us build a complete polite GET helper and use it to fetch pages from quotes.toscrape.com.
import time
import random
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
def make_session(user_agent: str = None) -> requests.Session:
"""Create a requests Session with retry logic and sensible defaults."""
session = requests.Session()
session.headers.update({
"User-Agent": user_agent or (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,*/*;q=0.8",
})
retry = Retry(
total=3,
backoff_factor=1.5,
status_forcelist=(429, 500, 502, 503, 504),
allowed_methods=frozenset(["GET", "POST"]),
)
adapter = HTTPAdapter(max_retries=retry)
session.mount("http://", adapter)
session.mount("https://", adapter)
return session
def polite_get(
session: requests.Session,
url: str,
min_delay: float = 1.0,
max_delay: float = 3.0,
**kwargs,
) -> requests.Response:
"""
GET a URL with a random delay before each request and raise on 4xx/5xx.
"""
delay = random.uniform(min_delay, max_delay)
time.sleep(delay)
response = session.get(url, timeout=(5, 30), **kwargs)
response.raise_for_status()
return response
# --- Main script ---
session = make_session()
base_url = "https://quotes.toscrape.com"
for page in range(1, 4): # pages 1, 2, 3
url = f"{base_url}/page/{page}/"
try:
r = polite_get(session, url)
print(f"Page {page}: {r.status_code}, {len(r.text)} bytes")
except requests.HTTPError as e:
print(f"Page {page}: HTTP error, {e}")
except requests.RequestException as e:
print(f"Page {page}: Network error, {e}")
Expected output:
Page 1: 200, 11373 bytes
Page 2: 200, 11269 bytes
Page 3: 200, 11283 bytes
Now let us try a POST, logging in to quotes.toscrape.com (it has a dummy login form):
session = make_session()
# First, fetch the login page to get any CSRF token
login_url = "https://quotes.toscrape.com/login"
r = session.get(login_url, timeout=10)
print("Login page status:", r.status_code)
# Post credentials (this site accepts any username/password)
r = session.post(
login_url,
data={"username": "testuser", "password": "testpass"},
timeout=10,
)
print("After login, redirected to:", r.url)
print("Cookies:", dict(session.cookies))
The session now carries the login cookie automatically for all subsequent requests.
Brief note on httpx for async
If you need to scrape many URLs concurrently, httpx with asyncio is faster than threading with requests. The API is nearly identical:
import asyncio
import httpx
async def fetch(client: httpx.AsyncClient, url: str) -> str:
r = await client.get(url, timeout=10)
r.raise_for_status()
return r.text
async def main():
urls = [f"https://quotes.toscrape.com/page/{i}/" for i in range(1, 6)]
async with httpx.AsyncClient() as client:
pages = await asyncio.gather(*[fetch(client, u) for u in urls])
for i, html in enumerate(pages, 1):
print(f"Page {i}: {len(html)} bytes")
asyncio.run(main())
For this path we will use synchronous requests for clarity. Use httpx + asyncio when you need to scrape hundreds of URLs quickly and the site does not throttle you.
Common pitfalls
-
Not using a Session for multi-page scraping. Every
requests.get()call outside a Session opens a new TCP connection. On a 1000-page crawl this is a measurable slowdown, and you lose cookies between pages. -
Retry on 404. Adding 404 to
status_forcelistwill hammer a missing page three times and waste time. Only retry on transient server errors (5xx) and rate limits (429). -
Timeout is two numbers.
timeout=30means 30 seconds for both connecting and reading. Prefertimeout=(5, 30), 5 seconds to connect, 30 seconds to receive the full response. A 30-second connection timeout means your script can hang for 30 s on a dead host. -
raise_for_status()after retries. The retry adapter will retry 5xx responses but still return a Response object with a non-200 status if all retries fail. You still need to callraise_for_status()or checkresponse.status_codeafter the call. -
Posting JSON as form data. Using
data={"key": "val"}when the API expectsContent-Type: application/jsonsends URL-encoded form data, which the server usually rejects with a 400 or 422. Usejson={"key": "val"}for JSON APIs. -
Hardcoding credentials in your script. Put secrets in environment variables or a
.envfile, never in source code that ends up in git.
What to try next
-
Use
httpbin.org/delay/2(which takes 2 seconds to respond) withtimeout=(5, 1). Observe theReadTimeoutexception. Then adjust totimeout=(5, 5)and observe success. -
Extend
polite_get()to accept aparamsargument and log the full URL (including query string) before each request. -
Visit a page on
quotes.toscrape.com, use DevTools to copy the request as cURL, and translate it manually to Python. Verify you get the same response.
Prefer watching over reading?
Subscribe for free.