APIs and AJAX

What you'll build

By the end of this lesson you will be able to open DevTools, identify the hidden JSON API a website's JavaScript is calling, replicate that API call in Python, handle bearer token and cookie authentication, and paginate through all results. Hitting the JSON API directly, instead of parsing HTML, is almost always faster, cleaner, and less brittle.

Concepts

Why API > HTML parsing, almost always

When a site loads data via AJAX, there is a clean JSON API hiding behind the HTML. Hitting that API directly gives you:

Structured data, no HTML parsing, no brittle selectors
Speed, API responses are smaller and faster than full HTML pages
Stability, API schemas change less often than visual layouts
Pagination, usually a clean ?page=N&per_page=50 pattern

The only reason to prefer HTML parsing over the API is if you cannot figure out the API, or if it requires auth you cannot replicate.

Finding hidden endpoints in DevTools

Open Chrome DevTools (F12), go to the Network tab.
Clear any existing entries, then reload the page or trigger the action that loads data.
Filter by Fetch/XHR, these are the AJAX calls.
Click each request. Look at the Preview tab. JSON data = you found your API.
Right-click → Copy → Copy as cURL to get the complete request with all headers.

This workflow takes 30 seconds and works on almost any site.

Translating cURL to Python

A typical copied cURL looks like:

curl 'https://api.github.com/search/repositories?q=scrapy&sort=stars&per_page=10' \
  -H 'accept: application/vnd.github+json' \
  -H 'authorization: Bearer ghp_yourtoken' \
  -H 'x-github-api-version: 2022-11-28'

Python equivalent:

import requests

r = requests.get(
    "https://api.github.com/search/repositories",
    params={
        "q": "scrapy",
        "sort": "stars",
        "per_page": 10,
    },
    headers={
        "accept": "application/vnd.github+json",
        "authorization": "Bearer ghp_yourtoken",
        "x-github-api-version": "2022-11-28",
    },
    timeout=10,
)
r.raise_for_status()
data = r.json()
print(f"Total results: {data['total_count']}")
for repo in data["items"]:
    print(repo["full_name"], ",", repo["stargazers_count"], "stars")

Authentication patterns

Bearer token (most common in modern APIs)

headers = {"Authorization": "Bearer YOUR_TOKEN_HERE"}
r = requests.get(url, headers=headers, timeout=10)

Cookie auth (log in first, carry the cookie)

session = requests.Session()

# Log in to get a session cookie
session.post(
    "https://example.com/api/login",
    json={"email": "user@example.com", "password": "secret"},
    timeout=10,
)

# Now make authenticated requests, cookie is carried automatically
r = session.get("https://example.com/api/profile", timeout=10)
print(r.json())

CSRF token (required for POST on many sites)

Many sites require a CSRF token in POST requests. You usually find it in:

The HTML page source (<input name="csrf_token" value="abc123">), A cookie named csrftoken or _csrf
A previous API response header X-CSRF-Token

import requests
from bs4 import BeautifulSoup

session = requests.Session()

# Fetch the login page to get the CSRF token
r = session.get("https://quotes.toscrape.com/login", timeout=10)
soup = BeautifulSoup(r.text, "lxml")
csrf_token = soup.find("input", {"name": "csrf_token"})["value"]

# Submit login with CSRF token
session.post(
    "https://quotes.toscrape.com/login",
    data={
        "csrf_token": csrf_token,
        "username": "testuser",
        "password": "testpass",
    },
    timeout=10,
)
print("Cookies:", dict(session.cookies))

Handling pagination

APIs paginate results in a few common ways:

Page-based:

params = {"page": 1, "per_page": 100}
# increment params["page"] until results are empty

Cursor/token-based:

# Response includes: {"data": [...], "next_cursor": "eyJpZCI6MTAwfQ=="}
# Pass cursor in next request: params={"cursor": next_cursor}
# Stop when next_cursor is None or absent

Link header (RFC 5988):

# Response header: Link: <https://api.example.com/items?page=2>; rel="next"
import requests

r = requests.get("https://api.github.com/repos/scrapy/scrapy/issues", timeout=10)
link = r.links.get("next")
if link:
    print("Next page URL:", link["url"])

requests parses Link headers automatically into response.links.

Hands-on

Let us use the GitHub Search API, a real, public, documented JSON API, to find the most-starred Python scraping libraries. No scraping needed, just clean API calls.

import time
import requests
from typing import Generator

BASE_URL = "https://api.github.com"

def github_search_repos(
    query: str,
    sort: str = "stars",
    per_page: int = 30,
    max_pages: int = 3,
) -> Generator[dict, None, None]:
    """
    Generator that yields repository dicts from GitHub Search API.
    Handles pagination automatically via Link headers.
    """
    session = requests.Session()
    session.headers.update({
        "Accept": "application/vnd.github+json",
        "X-GitHub-Api-Version": "2022-11-28",
        # Optionally: "Authorization": "Bearer YOUR_GITHUB_TOKEN"
        # Without auth, rate limit is 10 requests/min for search
    })

    url = f"{BASE_URL}/search/repositories"
    params = {"q": query, "sort": sort, "per_page": per_page, "page": 1}

    for page_num in range(1, max_pages + 1):
        params["page"] = page_num
        r = session.get(url, params=params, timeout=15)

        # GitHub returns 403 with rate limit message, check it
        if r.status_code == 403:
            print("Rate limited:", r.json().get("message"))
            break

        r.raise_for_status()
        data = r.json()
        items = data.get("items", [])

        if not items:
            print(f"No results on page {page_num}, stopping.")
            break

        print(f"Page {page_num}: {len(items)} repos (total found: {data['total_count']})")
        yield from items

        # Check for next page via Link header
        if "next" not in r.links:
            print("Last page reached.")
            break

        time.sleep(1)  # be polite to the API


def main():
    repos = list(github_search_repos(
        query="topic:scraping language:python",
        per_page=10,
        max_pages=2,
    ))

    print(f"\nTop {len(repos)} Python scraping repos:")
    for repo in repos:
        print(
            f"  {repo['full_name']:<40} "
            f"stars: {repo['stargazers_count']:>6}  "
            f"forks: {repo['forks_count']:>5}"
        )


if __name__ == "__main__":
    main()

Expected output (actual repos will vary):

Page 1: 10 repos (total found: 847)
Page 2: 10 repos (total found: 847)

Top 20 Python scraping repos:
  scrapy/scrapy                            stars:  52000  forks:  10000
  scrapy/splash                            stars:   4000  forks:    600
  ...

Now let us also demonstrate intercepting an AJAX call. Use httpbin.org/anything as a stand-in for "a hidden API endpoint you discovered in DevTools":

import requests

# Replicate a POST that sends JSON (as if you found this in DevTools → XHR)
r = requests.post(
    "https://httpbin.org/anything",
    json={"search": "python scraping", "filters": {"language": "en"}},
    headers={
        "x-requested-with": "XMLHttpRequest",   # common AJAX header
        "origin": "https://httpbin.org",
        "referer": "https://httpbin.org/",
    },
    timeout=10,
)

response_data = r.json()
print("Method seen by server:", response_data["method"])
print("JSON received:", response_data["json"])
print("Headers received:", {k: v for k, v in response_data["headers"].items()
                            if k.lower() in ("x-requested-with", "content-type")})

Common pitfalls

Not handling rate limits. GitHub's unauthenticated search API allows only 10 requests per minute. Check X-RateLimit-Remaining and X-RateLimit-Reset headers. Add a time.sleep() or check Retry-After on 429 responses.
Hardcoding session tokens. Tokens expire. Build your scraper to re-authenticate when it gets a 401. Store credentials in environment variables, not source code.
Assuming the API schema is stable. Internal (undocumented) APIs change without warning. Add robust handling for missing keys: data.get("items", []) instead of data["items"].
Ignoring pagination limits. Many APIs cap per_page at 100 and total_pages at some limit (GitHub Search caps at 1000 total results). Design your pagination loop to handle these caps gracefully.
Sending wrong Content-Type. When you use data= instead of json=, requests sends application/x-www-form-urlencoded. If the API expects JSON, it will return 400 or 422.
Missing the x-requested-with: XMLHttpRequest header. Some servers use this header to distinguish AJAX from regular page requests and return different responses. When your replicated API call returns HTML instead of JSON, check whether the real browser sends this header.

What to try next

Use the GitHub API to fetch all open issues for the Scrapy repository (/repos/scrapy/scrapy/issues) and paginate through all pages using the Link header. Count the total number of open issues.
Find a site you use regularly, open DevTools → Network → XHR/Fetch, and identify one AJAX call. Replicate it in Python. If it requires auth, try with your own login credentials.
Extend github_search_repos() to write results to a CSV file with csv.DictWriter. Include: full_name, stars, forks, language, description.