APIs and AJAX
What you'll build
By the end of this lesson you will be able to open DevTools, identify the hidden JSON API a website's JavaScript is calling, replicate that API call in Python, handle bearer token and cookie authentication, and paginate through all results. Hitting the JSON API directly, instead of parsing HTML, is almost always faster, cleaner, and less brittle.
Concepts
Why API > HTML parsing, almost always
When a site loads data via AJAX, there is a clean JSON API hiding behind the HTML. Hitting that API directly gives you:
- Structured data, no HTML parsing, no brittle selectors
- Speed, API responses are smaller and faster than full HTML pages
- Stability, API schemas change less often than visual layouts
- Pagination, usually a clean
?page=N&per_page=50pattern
The only reason to prefer HTML parsing over the API is if you cannot figure out the API, or if it requires auth you cannot replicate.
Finding hidden endpoints in DevTools
- Open Chrome DevTools (F12), go to the Network tab.
- Clear any existing entries, then reload the page or trigger the action that loads data.
- Filter by Fetch/XHR, these are the AJAX calls.
- Click each request. Look at the Preview tab. JSON data = you found your API.
- Right-click → Copy → Copy as cURL to get the complete request with all headers.
This workflow takes 30 seconds and works on almost any site.
Translating cURL to Python
A typical copied cURL looks like:
curl 'https://api.github.com/search/repositories?q=scrapy&sort=stars&per_page=10' \
-H 'accept: application/vnd.github+json' \
-H 'authorization: Bearer ghp_yourtoken' \
-H 'x-github-api-version: 2022-11-28'
Python equivalent:
import requests
r = requests.get(
"https://api.github.com/search/repositories",
params={
"q": "scrapy",
"sort": "stars",
"per_page": 10,
},
headers={
"accept": "application/vnd.github+json",
"authorization": "Bearer ghp_yourtoken",
"x-github-api-version": "2022-11-28",
},
timeout=10,
)
r.raise_for_status()
data = r.json()
print(f"Total results: {data['total_count']}")
for repo in data["items"]:
print(repo["full_name"], ",", repo["stargazers_count"], "stars")
Authentication patterns
Bearer token (most common in modern APIs)
headers = {"Authorization": "Bearer YOUR_TOKEN_HERE"}
r = requests.get(url, headers=headers, timeout=10)
Cookie auth (log in first, carry the cookie)
session = requests.Session()
# Log in to get a session cookie
session.post(
"https://example.com/api/login",
json={"email": "user@example.com", "password": "secret"},
timeout=10,
)
# Now make authenticated requests, cookie is carried automatically
r = session.get("https://example.com/api/profile", timeout=10)
print(r.json())
CSRF token (required for POST on many sites)
Many sites require a CSRF token in POST requests. You usually find it in:
- The HTML page source (
<input name="csrf_token" value="abc123">), A cookie namedcsrftokenor_csrf - A previous API response header
X-CSRF-Token
import requests
from bs4 import BeautifulSoup
session = requests.Session()
# Fetch the login page to get the CSRF token
r = session.get("https://quotes.toscrape.com/login", timeout=10)
soup = BeautifulSoup(r.text, "lxml")
csrf_token = soup.find("input", {"name": "csrf_token"})["value"]
# Submit login with CSRF token
session.post(
"https://quotes.toscrape.com/login",
data={
"csrf_token": csrf_token,
"username": "testuser",
"password": "testpass",
},
timeout=10,
)
print("Cookies:", dict(session.cookies))
Handling pagination
APIs paginate results in a few common ways:
Page-based:
params = {"page": 1, "per_page": 100}
# increment params["page"] until results are empty
Cursor/token-based:
# Response includes: {"data": [...], "next_cursor": "eyJpZCI6MTAwfQ=="}
# Pass cursor in next request: params={"cursor": next_cursor}
# Stop when next_cursor is None or absent
Link header (RFC 5988):
# Response header: Link: <https://api.example.com/items?page=2>; rel="next"
import requests
r = requests.get("https://api.github.com/repos/scrapy/scrapy/issues", timeout=10)
link = r.links.get("next")
if link:
print("Next page URL:", link["url"])
requests parses Link headers automatically into response.links.
Hands-on
Let us use the GitHub Search API, a real, public, documented JSON API, to find the most-starred Python scraping libraries. No scraping needed, just clean API calls.
import time
import requests
from typing import Generator
BASE_URL = "https://api.github.com"
def github_search_repos(
query: str,
sort: str = "stars",
per_page: int = 30,
max_pages: int = 3,
) -> Generator[dict, None, None]:
"""
Generator that yields repository dicts from GitHub Search API.
Handles pagination automatically via Link headers.
"""
session = requests.Session()
session.headers.update({
"Accept": "application/vnd.github+json",
"X-GitHub-Api-Version": "2022-11-28",
# Optionally: "Authorization": "Bearer YOUR_GITHUB_TOKEN"
# Without auth, rate limit is 10 requests/min for search
})
url = f"{BASE_URL}/search/repositories"
params = {"q": query, "sort": sort, "per_page": per_page, "page": 1}
for page_num in range(1, max_pages + 1):
params["page"] = page_num
r = session.get(url, params=params, timeout=15)
# GitHub returns 403 with rate limit message, check it
if r.status_code == 403:
print("Rate limited:", r.json().get("message"))
break
r.raise_for_status()
data = r.json()
items = data.get("items", [])
if not items:
print(f"No results on page {page_num}, stopping.")
break
print(f"Page {page_num}: {len(items)} repos (total found: {data['total_count']})")
yield from items
# Check for next page via Link header
if "next" not in r.links:
print("Last page reached.")
break
time.sleep(1) # be polite to the API
def main():
repos = list(github_search_repos(
query="topic:scraping language:python",
per_page=10,
max_pages=2,
))
print(f"\nTop {len(repos)} Python scraping repos:")
for repo in repos:
print(
f" {repo['full_name']:<40} "
f"stars: {repo['stargazers_count']:>6} "
f"forks: {repo['forks_count']:>5}"
)
if __name__ == "__main__":
main()
Expected output (actual repos will vary):
Page 1: 10 repos (total found: 847)
Page 2: 10 repos (total found: 847)
Top 20 Python scraping repos:
scrapy/scrapy stars: 52000 forks: 10000
scrapy/splash stars: 4000 forks: 600
...
Now let us also demonstrate intercepting an AJAX call. Use httpbin.org/anything as a stand-in for "a hidden API endpoint you discovered in DevTools":
import requests
# Replicate a POST that sends JSON (as if you found this in DevTools → XHR)
r = requests.post(
"https://httpbin.org/anything",
json={"search": "python scraping", "filters": {"language": "en"}},
headers={
"x-requested-with": "XMLHttpRequest", # common AJAX header
"origin": "https://httpbin.org",
"referer": "https://httpbin.org/",
},
timeout=10,
)
response_data = r.json()
print("Method seen by server:", response_data["method"])
print("JSON received:", response_data["json"])
print("Headers received:", {k: v for k, v in response_data["headers"].items()
if k.lower() in ("x-requested-with", "content-type")})
Common pitfalls
-
Not handling rate limits. GitHub's unauthenticated search API allows only 10 requests per minute. Check
X-RateLimit-RemainingandX-RateLimit-Resetheaders. Add atime.sleep()or checkRetry-Afteron 429 responses. -
Hardcoding session tokens. Tokens expire. Build your scraper to re-authenticate when it gets a 401. Store credentials in environment variables, not source code.
-
Assuming the API schema is stable. Internal (undocumented) APIs change without warning. Add robust handling for missing keys:
data.get("items", [])instead ofdata["items"]. -
Ignoring pagination limits. Many APIs cap
per_pageat 100 andtotal_pagesat some limit (GitHub Search caps at 1000 total results). Design your pagination loop to handle these caps gracefully. -
Sending wrong Content-Type. When you use
data=instead ofjson=, requests sendsapplication/x-www-form-urlencoded. If the API expects JSON, it will return 400 or 422. -
Missing the
x-requested-with: XMLHttpRequestheader. Some servers use this header to distinguish AJAX from regular page requests and return different responses. When your replicated API call returns HTML instead of JSON, check whether the real browser sends this header.
What to try next
-
Use the GitHub API to fetch all open issues for the Scrapy repository (
/repos/scrapy/scrapy/issues) and paginate through all pages using theLinkheader. Count the total number of open issues. -
Find a site you use regularly, open DevTools → Network → XHR/Fetch, and identify one AJAX call. Replicate it in Python. If it requires auth, try with your own login credentials.
-
Extend
github_search_repos()to write results to a CSV file withcsv.DictWriter. Include: full_name, stars, forks, language, description.
Prefer watching over reading?
Subscribe for free.