Lesson 3 of 107 min read

HTML Parsing with BeautifulSoup

Share:WhatsAppLinkedIn

What you'll build

By the end of this lesson you will have a working scraper that visits every page of books.toscrape.com, extracts each book's title, price, star rating, and availability, and prints them as structured records. You will also know how to handle the most common parsing failures gracefully, None returns, encoding quirks, and brittle selectors, so your scraper does not silently produce wrong data.

Concepts

Installing and loading a parser

BeautifulSoup needs a parser backend. html.parser ships with Python and works well for most pages. lxml is faster and more lenient on malformed HTML.

pip install beautifulsoup4 lxml
from bs4 import BeautifulSoup

html = """
<html>
  <body>
    <h1 class="title">Hello, World</h1>
    <p id="intro">This is a <strong>test</strong> page.</p>
  </body>
</html>
"""

soup = BeautifulSoup(html, "lxml")
print(soup.h1.text)          # Hello, World
print(soup.find("p").text)   # This is a test page.

Always pass "lxml" as the parser. html.parser is fine for clean HTML, but lxml handles broken HTML (mismatched tags, missing quotes) far better, and real-world websites are full of broken HTML.

find vs find_all

find() returns the first match or None. find_all() returns a list (possibly empty). This distinction matters a lot.

from bs4 import BeautifulSoup

html = "<ul><li>A</li><li>B</li><li>C</li></ul>"
soup = BeautifulSoup(html, "lxml")

first_li = soup.find("li")
print(first_li.text)   # A

all_li = soup.find_all("li")
print([el.text for el in all_li])  # ['A', 'B', 'C']

# find returns None if not found, not an exception
missing = soup.find("table")
print(missing)          # None
print(type(missing))    # <class 'NoneType'>

# find_all returns empty list if not found, never None
missing_all = soup.find_all("table")
print(missing_all)      # []

Never do soup.find("div").find("span").text without checking for None. If the outer find returns None, you get AttributeError: 'NoneType' object has no attribute 'find'. Always guard:

container = soup.find("div", class_="product")
if container:
    price = container.find("span", class_="price")
    print(price.text if price else "N/A")

CSS selectors with select

soup.select("css selector") always returns a list. soup.select_one("css selector") returns the first match or None. CSS selectors are usually more concise than nested find chains.

from bs4 import BeautifulSoup

html = """
<div class="product">
  <h2 class="name">Widget A</h2>
  <span class="price">Rs. 299</span>
</div>
<div class="product">
  <h2 class="name">Widget B</h2>
  <span class="price">Rs. 499</span>
</div>
"""
soup = BeautifulSoup(html, "lxml")

# All prices
prices = soup.select("div.product span.price")
print([p.text for p in prices])  # ['Rs. 299', 'Rs. 499']

# First product's name only
name = soup.select_one("div.product h2.name")
print(name.text)  # Widget A

CSS selector cheat sheet for scraping:

  • div.classname, element with class
  • #some-id, element with id
  • div > p, direct child
  • div p, descendant (any depth)
  • a[href], has attribute
  • a[href^="/"], href starts with /
  • li:nth-child(2), second li

Extracting text and attributes

.text and .get_text() give you the text content. .get("attr") gives you an attribute value safely.

from bs4 import BeautifulSoup

html = '<a href="/products/1" data-id="42">Buy now</a>'
soup = BeautifulSoup(html, "lxml")

a = soup.find("a")
print(a.text)           # Buy now
print(a.get("href"))    # /products/1
print(a.get("data-id")) # 42
print(a.get("class"))   # None (not AttributeError)

# .get_text() lets you specify a separator and strip whitespace
html2 = "<div><p>  Hello  </p><p>World</p></div>"
soup2 = BeautifulSoup(html2, "lxml")
print(soup2.div.get_text(separator=" | ", strip=True))  # Hello | World

Prefer .get("attr") over el["attr"]. The dict-style access raises KeyError if the attribute is missing; .get() returns None.

Navigating the tree

Sometimes the data you want has no class or id, you have to navigate relative to a sibling or parent.

from bs4 import BeautifulSoup

html = """
<table>
  <tr><th>Name</th><th>Price</th></tr>
  <tr><td>Widget</td><td>Rs. 99</td></tr>
</table>
"""
soup = BeautifulSoup(html, "lxml")

# Find the "Price" header, then get the sibling td in the data row
rows = soup.find_all("tr")[1:]  # skip header row
for row in rows:
    cells = row.find_all("td")
    name  = cells[0].text.strip()
    price = cells[1].text.strip()
    print(name, price)

Useful navigation properties:

  • .parent, one level up
  • .children, direct children (generator)
  • .next_sibling, .previous_sibling, adjacent nodes (may be whitespace NavigableString)
  • .next_element, very next node in the parse tree

Hands-on

Let us build a full scraper for books.toscrape.com. It paginates automatically, extracts the key fields, and stores them in a list of dicts.

import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass, asdict
import json
import time
import random

BASE_URL = "https://books.toscrape.com"

RATING_MAP = {
    "One": 1, "Two": 2, "Three": 3,
    "Four": 4, "Five": 5,
}

@dataclass
class Book:
    title: str
    price: float
    rating: int
    available: bool
    url: str


def parse_books(soup: BeautifulSoup, page_url: str) -> list[Book]:
    """Extract all books from a catalogue page."""
    books = []
    for article in soup.select("article.product_pod"):
        # Title is in the 'title' attribute of the <a> tag (full title)
        a_tag = article.select_one("h3 > a")
        title = a_tag.get("title", "").strip() if a_tag else "Unknown"

        # Price: strip the pound sign and convert
        price_el = article.select_one("p.price_color")
        try:
            price = float(price_el.text.strip().replace("£", "").replace("Â", ""))
        except (AttributeError, ValueError):
            price = 0.0

        # Rating: class="star-rating Three", second word is the word
        rating_el = article.select_one("p.star-rating")
        rating_word = rating_el.get("class", ["", ""])[1] if rating_el else "Zero"
        rating = RATING_MAP.get(rating_word, 0)

        # Availability
        avail_el = article.select_one("p.availability")
        available = "In stock" in (avail_el.text if avail_el else "")

        # Relative URL → absolute
        href = a_tag.get("href", "") if a_tag else ""
        full_url = BASE_URL + "/catalogue/" + href.replace("../", "")

        books.append(Book(title, price, rating, available, full_url))
    return books


def scrape_all_books(max_pages: int = 5) -> list[Book]:
    """Crawl up to max_pages pages of the catalogue."""
    session = requests.Session()
    session.headers["User-Agent"] = (
        "Mozilla/5.0 (compatible; BookScraper/1.0; +https://example.com)"
    )

    all_books = []
    url = f"{BASE_URL}/catalogue/page-1.html"

    for page_num in range(1, max_pages + 1):
        time.sleep(random.uniform(1, 2))
        r = session.get(url, timeout=15)
        r.raise_for_status()

        soup = BeautifulSoup(r.content, "lxml")
        books = parse_books(soup, url)
        all_books.extend(books)
        print(f"Page {page_num}: scraped {len(books)} books (total {len(all_books)})")

        # Find the "next" button
        next_btn = soup.select_one("li.next > a")
        if not next_btn:
            print("No more pages.")
            break
        next_href = next_btn.get("href", "")
        url = f"{BASE_URL}/catalogue/{next_href}"

    return all_books


if __name__ == "__main__":
    books = scrape_all_books(max_pages=5)
    print(f"\nTotal books scraped: {len(books)}")
    print("Sample:")
    for b in books[:3]:
        print(" ", asdict(b))

Expected output (titles and prices will match the site):

Page 1: scraped 20 books (total 20)
Page 2: scraped 20 books (total 40)
Page 3: scraped 20 books (total 60)
Page 4: scraped 20 books (total 80)
Page 5: scraped 20 books (total 100)

Total books scraped: 100
Sample:
  {'title': 'A Light in the Attic', 'price': 51.77, 'rating': 3, 'available': True, ...}

Note the use of r.content (bytes) instead of r.text (string) when creating the soup. Passing bytes lets BeautifulSoup detect encoding from the HTML <meta charset> tag, which is more reliable than trusting the HTTP header.

Common pitfalls

  • Calling .text on None. soup.find("div", class_="missing") returns None. Chaining .text on it gives AttributeError. Always check for None before accessing attributes, or use the or idiom: (el.text if el else "").

  • Brittle positional selectors. soup.find_all("td")[5] breaks the moment the site adds a column. Prefer class-based or attribute-based selectors tied to semantic meaning.

  • Price character encoding. books.toscrape.com often includes a  character before the £ sign when encoding goes wrong. Always strip and clean numeric fields before conversion.

  • Short text vs full title. The visible text of <a> tags is often truncated. The title attribute usually has the full string. Check both.

  • Whitespace in .text. Text nodes include surrounding whitespace and newlines. Always .strip() before storing or comparing.

  • Assuming the next-page link is always present. On the last page, li.next does not exist. Your loop must check for None before following the link, or you get an AttributeError trying to call .get("href") on None.

What to try next

  1. Extend the scraper to also extract the genre (from the breadcrumb navigation) for each book. Hint: soup.select("ul.breadcrumb li").

  2. Scrape a specific genre page (e.g. https://books.toscrape.com/catalogue/category/books/mystery_3/index.html) and compare average prices across three different genres.

  3. Add error handling so that if any individual book's price fails to parse, the script logs a warning but continues rather than crashing.

Test Your Knowledge

Take a quick quiz on this lesson

Start Quiz →

Prefer watching over reading?

Subscribe for free.

Subscribe on YouTube