Lesson 1 of 107 min read

How the Web Serves Data

Share:WhatsAppLinkedIn

What you'll build

By the end of this lesson you will be able to open any website in Chrome DevTools, identify whether it serves plain HTML, a JSON API, or a JavaScript-rendered SPA, and decide which scraping approach fits. You will also make your first automated HTTP request in Python and verify it with curl. This mental model is the foundation for everything else in this path, get it right and the rest falls into place.

Concepts

The request-response cycle

Every interaction on the web is a client asking for something and a server answering. When you type a URL and press Enter, here is what happens:

  1. Your browser resolves the domain to an IP address via DNS.
  2. It opens a TCP connection (with TLS handshake for HTTPS).
  3. It sends an HTTP request, a structured text message with a method, path, headers, and optional body.
  4. The server reads that request and sends back an HTTP response, a status code, headers, and a body.

That body is what your scraper cares about. It could be HTML, JSON, XML, a binary file, or anything else the server feels like sending.

# Inspect the raw response headers of any URL
curl -I https://quotes.toscrape.com

You will see something like:

HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Server: gunicorn/19.7.1

The -I flag sends a HEAD request (headers only, no body). Useful for checking if a page exists, what content type it returns, and whether it redirects.

HTTP status codes you must know

Status codes are three-digit numbers the server sends to tell you what happened.

Range Meaning Examples you will hit
2xx Success 200 OK, 201 Created
3xx Redirect 301 Permanent, 302 Temporary
4xx Your fault 400 Bad Request, 403 Forbidden, 404 Not Found, 429 Too Many Requests
5xx Server's fault 500 Internal Error, 503 Service Unavailable

A scraper that blindly assumes every response is 200 will produce garbage data silently. Always check the status code.

import requests

response = requests.get("https://quotes.toscrape.com")
print(response.status_code)   # 200
print(response.headers["Content-Type"])  # text/html; charset=utf-8

The three categories of web pages

This is the most important thing to understand before you write a single line of scraping code. Pick the wrong tool and you will waste hours.

Category 1, Server-rendered HTML

The server builds the complete HTML page and sends it in one response. When you curl the URL, you get readable HTML with the actual content inside it. This is the easiest to scrape. Tools: requests + BeautifulSoup or Scrapy.

Classic signs: Content-Type: text/html, actual text content visible in curl output, minimal JavaScript in the source.

Category 2, JSON APIs

The server exposes a clean JSON endpoint. The "website" is often just a thin JavaScript frontend that calls this API. If you find the API URL, you can hit it directly and get structured data, no HTML parsing needed.

Classic signs: Content-Type: application/json, URL paths like /api/v2/..., data visible in the Network tab under XHR/Fetch requests.

Category 3, JavaScript-rendered SPAs

The server sends a mostly empty HTML shell, and JavaScript in the browser fetches data and builds the DOM. If you curl the URL, you get a skeleton with no useful content. You need a real browser (or a headless one) to render the page.

Classic signs: curl output shows <div id="root"></div> or similar, all the content appears in the browser but not in curl.

How to identify the category in DevTools

Open Chrome (or Firefox), press F12, go to the Network tab. Reload the page. Now look:

  • Click on the first request (the HTML document). Check the Response tab. If you see your content, it is Category 1., Filter by XHR/Fetch. If there are requests to /api/... or similar that return JSON with your data, it is Category 2., If the Response tab for the main document is nearly empty but the page looks full, it is Category 3.

This three-second check will save you hours.

Making your first request with Python

requests is the standard library for HTTP in Python. It is not in the standard library, install it with pip.

pip install requests
import requests

# A simple GET request
response = requests.get("https://quotes.toscrape.com")

# Check that it succeeded
assert response.status_code == 200, f"Got {response.status_code}"

# The body as text (decoded using the charset from headers)
html = response.text
print(html[:500])  # First 500 characters

# The body as raw bytes (useful for non-text content)
raw = response.content
print(len(raw), "bytes")

Hands-on

Let us identify the category for three real URLs and make programmatic requests to each.

import requests

targets = [
    ("Server-rendered HTML", "https://quotes.toscrape.com"),
    ("JSON API",             "https://api.github.com/repos/psf/requests"),
    ("Static JSON",         "https://httpbin.org/json"),
]

for label, url in targets:
    r = requests.get(url, timeout=10)
    ct = r.headers.get("Content-Type", "unknown")
    print(f"{label}")
    print(f"  URL:          {url}")
    print(f"  Status:       {r.status_code}")
    print(f"  Content-Type: {ct}")
    print(f"  Body snippet: {r.text[:80].strip()}")
    print()

Expected output (trimmed):

Server-rendered HTML
  URL:          https://quotes.toscrape.com
  Status:       200
  Content-Type: text/html; charset=utf-8
  Body snippet: <!DOCTYPE html>

JSON API
  URL:          https://api.github.com/repos/psf/requests
  Status:       200
  Content-Type: application/json; charset=utf-8
  Body snippet: {"id":1362490,"node_id":"MDEwOlJlcG9zaXRvcnkxMzYyNDkw...

Static JSON
  URL:          https://httpbin.org/json
  Status:       200
  Content-Type: application/json
  Body snippet: {"slideshow": {"author": "Yours Truly", "date": "date of pub...

Now let us check redirect behaviour:

import requests

# requests follows redirects by default
r = requests.get("https://httpbin.org/redirect/2", timeout=10)
print("Final URL:", r.url)
print("Redirect history:", [resp.status_code for resp in r.history])

# Disable redirect following to see the raw 302
r2 = requests.get("https://httpbin.org/redirect/1", allow_redirects=False, timeout=10)
print("Status without following:", r2.status_code)
print("Location header:", r2.headers.get("Location"))

Finally, inspect request headers by using httpbin.org/headers, it echoes back exactly what your scraper is sending. This is your best tool for debugging why a server is rejecting you.

import requests

r = requests.get("https://httpbin.org/headers", timeout=10)
print(r.json())
# {'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate',
#              'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.31.0'}}

Notice the User-Agent. Many sites block python-requests/... immediately. We will deal with that in Lesson 7, but this is where you first see the problem.

Common pitfalls

  • Assuming all responses are 200. A server can return 200 with an error page ("Please log in to continue") or a 403 with a helpful message. Always check both the status code and that the body actually contains what you expect.

  • Treating HTML and JSON endpoints interchangeably. If Content-Type is text/html but you call response.json(), you get a JSONDecodeError. Check the content type before deciding how to parse.

  • Missing charset. response.text uses the encoding declared in the Content-Type header, or tries to detect it. When that is wrong (common with older Indian government sites), you get mojibake. Use response.encoding = 'utf-8' to override before reading .text, or use response.content.decode('utf-8') directly.

  • Ignoring redirects. A site might redirect http:// to https://, or a page might have moved. By default requests follows up to 30 redirects, which is fine. But if you disable redirects, check response.headers['Location'] manually.

  • Confusing curl's -I (HEAD) with curl's -i (include headers in GET). HEAD requests are sometimes handled differently on the server and may return a different status than a real GET. When debugging, use curl -i https://example.com to see headers and body together.

  • Forgetting timeouts. requests.get(url) with no timeout will hang forever if the server is slow or unresponsive. Always pass timeout=(connect_timeout, read_timeout), e.g. timeout=(5, 30).

What to try next

  1. Open https://books.toscrape.com in DevTools, go to the Network tab, and identify every XHR/Fetch request the page makes. Is it Category 1, 2, or 3? Try curl -I https://books.toscrape.com to confirm.

  2. Write a small script that takes a list of 5 URLs and prints the status code, Content-Type, and category (your own categorisation logic) for each. Handle timeouts gracefully with a try/except.

  3. Visit https://httpbin.org/status/404 and https://httpbin.org/status/429 in Python. Print the status codes. Practice writing code that raises an exception for anything that is not 2xx using response.raise_for_status().

Test Your Knowledge

Take a quick quiz on this lesson

Start Quiz →

Prefer watching over reading?

Subscribe for free.

Subscribe on YouTube