How the Web Serves Data
What you'll build
By the end of this lesson you will be able to open any website in Chrome DevTools, identify whether it serves plain HTML, a JSON API, or a JavaScript-rendered SPA, and decide which scraping approach fits. You will also make your first automated HTTP request in Python and verify it with curl. This mental model is the foundation for everything else in this path, get it right and the rest falls into place.
Concepts
The request-response cycle
Every interaction on the web is a client asking for something and a server answering. When you type a URL and press Enter, here is what happens:
- Your browser resolves the domain to an IP address via DNS.
- It opens a TCP connection (with TLS handshake for HTTPS).
- It sends an HTTP request, a structured text message with a method, path, headers, and optional body.
- The server reads that request and sends back an HTTP response, a status code, headers, and a body.
That body is what your scraper cares about. It could be HTML, JSON, XML, a binary file, or anything else the server feels like sending.
# Inspect the raw response headers of any URL
curl -I https://quotes.toscrape.com
You will see something like:
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Server: gunicorn/19.7.1
The -I flag sends a HEAD request (headers only, no body). Useful for checking if a page exists, what content type it returns, and whether it redirects.
HTTP status codes you must know
Status codes are three-digit numbers the server sends to tell you what happened.
| Range | Meaning | Examples you will hit |
|---|---|---|
| 2xx | Success | 200 OK, 201 Created |
| 3xx | Redirect | 301 Permanent, 302 Temporary |
| 4xx | Your fault | 400 Bad Request, 403 Forbidden, 404 Not Found, 429 Too Many Requests |
| 5xx | Server's fault | 500 Internal Error, 503 Service Unavailable |
A scraper that blindly assumes every response is 200 will produce garbage data silently. Always check the status code.
import requests
response = requests.get("https://quotes.toscrape.com")
print(response.status_code) # 200
print(response.headers["Content-Type"]) # text/html; charset=utf-8
The three categories of web pages
This is the most important thing to understand before you write a single line of scraping code. Pick the wrong tool and you will waste hours.
Category 1, Server-rendered HTML
The server builds the complete HTML page and sends it in one response. When you curl the URL, you get readable HTML with the actual content inside it. This is the easiest to scrape. Tools: requests + BeautifulSoup or Scrapy.
Classic signs: Content-Type: text/html, actual text content visible in curl output, minimal JavaScript in the source.
Category 2, JSON APIs
The server exposes a clean JSON endpoint. The "website" is often just a thin JavaScript frontend that calls this API. If you find the API URL, you can hit it directly and get structured data, no HTML parsing needed.
Classic signs: Content-Type: application/json, URL paths like /api/v2/..., data visible in the Network tab under XHR/Fetch requests.
Category 3, JavaScript-rendered SPAs
The server sends a mostly empty HTML shell, and JavaScript in the browser fetches data and builds the DOM. If you curl the URL, you get a skeleton with no useful content. You need a real browser (or a headless one) to render the page.
Classic signs: curl output shows <div id="root"></div> or similar, all the content appears in the browser but not in curl.
How to identify the category in DevTools
Open Chrome (or Firefox), press F12, go to the Network tab. Reload the page. Now look:
- Click on the first request (the HTML document). Check the Response tab. If you see your content, it is Category 1., Filter by XHR/Fetch. If there are requests to
/api/...or similar that return JSON with your data, it is Category 2., If the Response tab for the main document is nearly empty but the page looks full, it is Category 3.
This three-second check will save you hours.
Making your first request with Python
requests is the standard library for HTTP in Python. It is not in the standard library, install it with pip.
pip install requests
import requests
# A simple GET request
response = requests.get("https://quotes.toscrape.com")
# Check that it succeeded
assert response.status_code == 200, f"Got {response.status_code}"
# The body as text (decoded using the charset from headers)
html = response.text
print(html[:500]) # First 500 characters
# The body as raw bytes (useful for non-text content)
raw = response.content
print(len(raw), "bytes")
Hands-on
Let us identify the category for three real URLs and make programmatic requests to each.
import requests
targets = [
("Server-rendered HTML", "https://quotes.toscrape.com"),
("JSON API", "https://api.github.com/repos/psf/requests"),
("Static JSON", "https://httpbin.org/json"),
]
for label, url in targets:
r = requests.get(url, timeout=10)
ct = r.headers.get("Content-Type", "unknown")
print(f"{label}")
print(f" URL: {url}")
print(f" Status: {r.status_code}")
print(f" Content-Type: {ct}")
print(f" Body snippet: {r.text[:80].strip()}")
print()
Expected output (trimmed):
Server-rendered HTML
URL: https://quotes.toscrape.com
Status: 200
Content-Type: text/html; charset=utf-8
Body snippet: <!DOCTYPE html>
JSON API
URL: https://api.github.com/repos/psf/requests
Status: 200
Content-Type: application/json; charset=utf-8
Body snippet: {"id":1362490,"node_id":"MDEwOlJlcG9zaXRvcnkxMzYyNDkw...
Static JSON
URL: https://httpbin.org/json
Status: 200
Content-Type: application/json
Body snippet: {"slideshow": {"author": "Yours Truly", "date": "date of pub...
Now let us check redirect behaviour:
import requests
# requests follows redirects by default
r = requests.get("https://httpbin.org/redirect/2", timeout=10)
print("Final URL:", r.url)
print("Redirect history:", [resp.status_code for resp in r.history])
# Disable redirect following to see the raw 302
r2 = requests.get("https://httpbin.org/redirect/1", allow_redirects=False, timeout=10)
print("Status without following:", r2.status_code)
print("Location header:", r2.headers.get("Location"))
Finally, inspect request headers by using httpbin.org/headers, it echoes back exactly what your scraper is sending. This is your best tool for debugging why a server is rejecting you.
import requests
r = requests.get("https://httpbin.org/headers", timeout=10)
print(r.json())
# {'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate',
# 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.31.0'}}
Notice the User-Agent. Many sites block python-requests/... immediately. We will deal with that in Lesson 7, but this is where you first see the problem.
Common pitfalls
-
Assuming all responses are 200. A server can return 200 with an error page ("Please log in to continue") or a 403 with a helpful message. Always check both the status code and that the body actually contains what you expect.
-
Treating HTML and JSON endpoints interchangeably. If
Content-Typeistext/htmlbut you callresponse.json(), you get aJSONDecodeError. Check the content type before deciding how to parse. -
Missing charset.
response.textuses the encoding declared in the Content-Type header, or tries to detect it. When that is wrong (common with older Indian government sites), you get mojibake. Useresponse.encoding = 'utf-8'to override before reading.text, or useresponse.content.decode('utf-8')directly. -
Ignoring redirects. A site might redirect
http://tohttps://, or a page might have moved. By defaultrequestsfollows up to 30 redirects, which is fine. But if you disable redirects, checkresponse.headers['Location']manually. -
Confusing curl's
-I(HEAD) with curl's-i(include headers in GET). HEAD requests are sometimes handled differently on the server and may return a different status than a real GET. When debugging, usecurl -i https://example.comto see headers and body together. -
Forgetting timeouts.
requests.get(url)with no timeout will hang forever if the server is slow or unresponsive. Always passtimeout=(connect_timeout, read_timeout), e.g.timeout=(5, 30).
What to try next
-
Open
https://books.toscrape.comin DevTools, go to the Network tab, and identify every XHR/Fetch request the page makes. Is it Category 1, 2, or 3? Trycurl -I https://books.toscrape.comto confirm. -
Write a small script that takes a list of 5 URLs and prints the status code, Content-Type, and category (your own categorisation logic) for each. Handle timeouts gracefully with a try/except.
-
Visit
https://httpbin.org/status/404andhttps://httpbin.org/status/429in Python. Print the status codes. Practice writing code that raises an exception for anything that is not 2xx usingresponse.raise_for_status().
Prefer watching over reading?
Subscribe for free.