HTML Parsing with BeautifulSoup
What you'll build
By the end of this lesson you will have a working scraper that visits every page of books.toscrape.com, extracts each book's title, price, star rating, and availability, and prints them as structured records. You will also know how to handle the most common parsing failures gracefully, None returns, encoding quirks, and brittle selectors, so your scraper does not silently produce wrong data.
Concepts
Installing and loading a parser
BeautifulSoup needs a parser backend. html.parser ships with Python and works well for most pages. lxml is faster and more lenient on malformed HTML.
pip install beautifulsoup4 lxml
from bs4 import BeautifulSoup
html = """
<html>
<body>
<h1 class="title">Hello, World</h1>
<p id="intro">This is a <strong>test</strong> page.</p>
</body>
</html>
"""
soup = BeautifulSoup(html, "lxml")
print(soup.h1.text) # Hello, World
print(soup.find("p").text) # This is a test page.
Always pass "lxml" as the parser. html.parser is fine for clean HTML, but lxml handles broken HTML (mismatched tags, missing quotes) far better, and real-world websites are full of broken HTML.
find vs find_all
find() returns the first match or None. find_all() returns a list (possibly empty). This distinction matters a lot.
from bs4 import BeautifulSoup
html = "<ul><li>A</li><li>B</li><li>C</li></ul>"
soup = BeautifulSoup(html, "lxml")
first_li = soup.find("li")
print(first_li.text) # A
all_li = soup.find_all("li")
print([el.text for el in all_li]) # ['A', 'B', 'C']
# find returns None if not found, not an exception
missing = soup.find("table")
print(missing) # None
print(type(missing)) # <class 'NoneType'>
# find_all returns empty list if not found, never None
missing_all = soup.find_all("table")
print(missing_all) # []
Never do soup.find("div").find("span").text without checking for None. If the outer find returns None, you get AttributeError: 'NoneType' object has no attribute 'find'. Always guard:
container = soup.find("div", class_="product")
if container:
price = container.find("span", class_="price")
print(price.text if price else "N/A")
CSS selectors with select
soup.select("css selector") always returns a list. soup.select_one("css selector") returns the first match or None. CSS selectors are usually more concise than nested find chains.
from bs4 import BeautifulSoup
html = """
<div class="product">
<h2 class="name">Widget A</h2>
<span class="price">Rs. 299</span>
</div>
<div class="product">
<h2 class="name">Widget B</h2>
<span class="price">Rs. 499</span>
</div>
"""
soup = BeautifulSoup(html, "lxml")
# All prices
prices = soup.select("div.product span.price")
print([p.text for p in prices]) # ['Rs. 299', 'Rs. 499']
# First product's name only
name = soup.select_one("div.product h2.name")
print(name.text) # Widget A
CSS selector cheat sheet for scraping:
div.classname, element with class#some-id, element with iddiv > p, direct childdiv p, descendant (any depth)a[href], has attributea[href^="/"], href starts with/li:nth-child(2), second li
Extracting text and attributes
.text and .get_text() give you the text content. .get("attr") gives you an attribute value safely.
from bs4 import BeautifulSoup
html = '<a href="/products/1" data-id="42">Buy now</a>'
soup = BeautifulSoup(html, "lxml")
a = soup.find("a")
print(a.text) # Buy now
print(a.get("href")) # /products/1
print(a.get("data-id")) # 42
print(a.get("class")) # None (not AttributeError)
# .get_text() lets you specify a separator and strip whitespace
html2 = "<div><p> Hello </p><p>World</p></div>"
soup2 = BeautifulSoup(html2, "lxml")
print(soup2.div.get_text(separator=" | ", strip=True)) # Hello | World
Prefer .get("attr") over el["attr"]. The dict-style access raises KeyError if the attribute is missing; .get() returns None.
Navigating the tree
Sometimes the data you want has no class or id, you have to navigate relative to a sibling or parent.
from bs4 import BeautifulSoup
html = """
<table>
<tr><th>Name</th><th>Price</th></tr>
<tr><td>Widget</td><td>Rs. 99</td></tr>
</table>
"""
soup = BeautifulSoup(html, "lxml")
# Find the "Price" header, then get the sibling td in the data row
rows = soup.find_all("tr")[1:] # skip header row
for row in rows:
cells = row.find_all("td")
name = cells[0].text.strip()
price = cells[1].text.strip()
print(name, price)
Useful navigation properties:
.parent, one level up.children, direct children (generator).next_sibling,.previous_sibling, adjacent nodes (may be whitespaceNavigableString).next_element, very next node in the parse tree
Hands-on
Let us build a full scraper for books.toscrape.com. It paginates automatically, extracts the key fields, and stores them in a list of dicts.
import requests
from bs4 import BeautifulSoup
from dataclasses import dataclass, asdict
import json
import time
import random
BASE_URL = "https://books.toscrape.com"
RATING_MAP = {
"One": 1, "Two": 2, "Three": 3,
"Four": 4, "Five": 5,
}
@dataclass
class Book:
title: str
price: float
rating: int
available: bool
url: str
def parse_books(soup: BeautifulSoup, page_url: str) -> list[Book]:
"""Extract all books from a catalogue page."""
books = []
for article in soup.select("article.product_pod"):
# Title is in the 'title' attribute of the <a> tag (full title)
a_tag = article.select_one("h3 > a")
title = a_tag.get("title", "").strip() if a_tag else "Unknown"
# Price: strip the pound sign and convert
price_el = article.select_one("p.price_color")
try:
price = float(price_el.text.strip().replace("£", "").replace("Â", ""))
except (AttributeError, ValueError):
price = 0.0
# Rating: class="star-rating Three", second word is the word
rating_el = article.select_one("p.star-rating")
rating_word = rating_el.get("class", ["", ""])[1] if rating_el else "Zero"
rating = RATING_MAP.get(rating_word, 0)
# Availability
avail_el = article.select_one("p.availability")
available = "In stock" in (avail_el.text if avail_el else "")
# Relative URL → absolute
href = a_tag.get("href", "") if a_tag else ""
full_url = BASE_URL + "/catalogue/" + href.replace("../", "")
books.append(Book(title, price, rating, available, full_url))
return books
def scrape_all_books(max_pages: int = 5) -> list[Book]:
"""Crawl up to max_pages pages of the catalogue."""
session = requests.Session()
session.headers["User-Agent"] = (
"Mozilla/5.0 (compatible; BookScraper/1.0; +https://example.com)"
)
all_books = []
url = f"{BASE_URL}/catalogue/page-1.html"
for page_num in range(1, max_pages + 1):
time.sleep(random.uniform(1, 2))
r = session.get(url, timeout=15)
r.raise_for_status()
soup = BeautifulSoup(r.content, "lxml")
books = parse_books(soup, url)
all_books.extend(books)
print(f"Page {page_num}: scraped {len(books)} books (total {len(all_books)})")
# Find the "next" button
next_btn = soup.select_one("li.next > a")
if not next_btn:
print("No more pages.")
break
next_href = next_btn.get("href", "")
url = f"{BASE_URL}/catalogue/{next_href}"
return all_books
if __name__ == "__main__":
books = scrape_all_books(max_pages=5)
print(f"\nTotal books scraped: {len(books)}")
print("Sample:")
for b in books[:3]:
print(" ", asdict(b))
Expected output (titles and prices will match the site):
Page 1: scraped 20 books (total 20)
Page 2: scraped 20 books (total 40)
Page 3: scraped 20 books (total 60)
Page 4: scraped 20 books (total 80)
Page 5: scraped 20 books (total 100)
Total books scraped: 100
Sample:
{'title': 'A Light in the Attic', 'price': 51.77, 'rating': 3, 'available': True, ...}
Note the use of r.content (bytes) instead of r.text (string) when creating the soup. Passing bytes lets BeautifulSoup detect encoding from the HTML <meta charset> tag, which is more reliable than trusting the HTTP header.
Common pitfalls
-
Calling
.textonNone.soup.find("div", class_="missing")returnsNone. Chaining.texton it givesAttributeError. Always check forNonebefore accessing attributes, or use theoridiom:(el.text if el else ""). -
Brittle positional selectors.
soup.find_all("td")[5]breaks the moment the site adds a column. Prefer class-based or attribute-based selectors tied to semantic meaning. -
Price character encoding.
books.toscrape.comoften includes a  character before the £ sign when encoding goes wrong. Always strip and clean numeric fields before conversion. -
Short text vs full title. The visible text of
<a>tags is often truncated. Thetitleattribute usually has the full string. Check both. -
Whitespace in
.text. Text nodes include surrounding whitespace and newlines. Always.strip()before storing or comparing. -
Assuming the next-page link is always present. On the last page,
li.nextdoes not exist. Your loop must check forNonebefore following the link, or you get anAttributeErrortrying to call.get("href")onNone.
What to try next
-
Extend the scraper to also extract the genre (from the breadcrumb navigation) for each book. Hint:
soup.select("ul.breadcrumb li"). -
Scrape a specific genre page (e.g.
https://books.toscrape.com/catalogue/category/books/mystery_3/index.html) and compare average prices across three different genres. -
Add error handling so that if any individual book's price fails to parse, the script logs a warning but continues rather than crashing.
Prefer watching over reading?
Subscribe for free.