Lesson 8 of 109 min read

Storing Scraped Data

Share:WhatsAppLinkedIn

What you'll build

By the end of this lesson you will have a storage layer that writes scraped data to SQLite with idempotent upserts (safe to re-run without duplicates), tracks each crawl run in a metadata table, and knows when to graduate to Postgres or Parquet. You will also build a simple deduplication system using URL hashing.

Concepts

CSV, the right first format

For quick experiments and data sharing, CSV is hard to beat. Python's csv module handles it without dependencies.

import csv

books = [
    {"title": "Fluent Python", "price": 45.99, "rating": 5},
    {"title": "Clean Code",    "price": 32.50, "rating": 4},
]

# Writing
with open("books.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["title", "price", "rating"])
    writer.writeheader()
    writer.writerows(books)

# Reading back
with open("books.csv", "r", newline="", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)

Always pass newline="" when opening files for csv. Without it, the CSV writer adds an extra blank line on Windows. Always pass encoding="utf-8", never rely on the system default.

CSV weaknesses: no types (everything is a string), no schema enforcement, appending without duplicating headers is awkward, and it breaks on commas/newlines in values unless properly quoted (the csv module handles this, but naive string concatenation does not).

JSON Lines, streaming structured data

JSON Lines (.jsonl) is one JSON object per line. It is streamable, appendable, and typed.

import json

data = [
    {"url": "https://example.com/1", "title": "Page One", "scraped_at": "2026-01-15"},
    {"url": "https://example.com/2", "title": "Page Two", "scraped_at": "2026-01-15"},
]

# Write
with open("output.jsonl", "w", encoding="utf-8") as f:
    for record in data:
        f.write(json.dumps(record, ensure_ascii=False) + "\n")

# Read, one line at a time (memory efficient for large files)
with open("output.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        obj = json.loads(line.strip())
        print(obj["title"])

Use ensure_ascii=False so non-ASCII characters (Hindi, Tamil, Chinese, etc.) are stored as-is rather than as \uXXXX escape sequences.

SQLite, the right choice for local persistence

SQLite is a file-based SQL database built into Python. No server to install. Perfect for scrapers running on a single machine.

import sqlite3
import hashlib
from datetime import datetime

DB_PATH = "scraped.db"

def get_url_hash(url: str) -> str:
    """Stable identifier for a URL, used as a deduplication key."""
    return hashlib.sha256(url.encode()).hexdigest()[:16]

def init_db(conn: sqlite3.Connection):
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS books (
            id          TEXT PRIMARY KEY,   -- URL hash
            url         TEXT UNIQUE NOT NULL,
            title       TEXT NOT NULL,
            price       REAL,
            rating      INTEGER,
            available   INTEGER,           -- 0 or 1 (SQLite has no boolean)
            scraped_at  TEXT NOT NULL
        );

        CREATE TABLE IF NOT EXISTS crawl_runs (
            run_id      INTEGER PRIMARY KEY AUTOINCREMENT,
            spider      TEXT NOT NULL,
            started_at  TEXT NOT NULL,
            finished_at TEXT,
            pages       INTEGER DEFAULT 0,
            items       INTEGER DEFAULT 0
        );
    """)
    conn.commit()

def upsert_book(conn: sqlite3.Connection, book: dict):
    """Insert or replace a book record, safe to call multiple times."""
    url_hash = get_url_hash(book["url"])
    conn.execute("""
        INSERT INTO books (id, url, title, price, rating, available, scraped_at)
        VALUES (?, ?, ?, ?, ?, ?, ?)
        ON CONFLICT(id) DO UPDATE SET
            title      = excluded.title,
            price      = excluded.price,
            rating     = excluded.rating,
            available  = excluded.available,
            scraped_at = excluded.scraped_at
    """, (
        url_hash,
        book["url"],
        book["title"],
        book.get("price"),
        book.get("rating"),
        int(book.get("available", True)),
        datetime.utcnow().isoformat(),
    ))

The ON CONFLICT ... DO UPDATE clause is an "upsert". If a book with the same id already exists, it updates the fields instead of failing. This makes your scraper safe to re-run, you will not end up with duplicate rows.

The crawl_runs table

Always log metadata about each scrape run. You will thank yourself when your scraper breaks at 3 AM.

def start_run(conn: sqlite3.Connection, spider_name: str) -> int:
    cursor = conn.execute(
        "INSERT INTO crawl_runs (spider, started_at) VALUES (?, ?)",
        (spider_name, datetime.utcnow().isoformat()),
    )
    conn.commit()
    return cursor.lastrowid

def finish_run(conn: sqlite3.Connection, run_id: int, pages: int, items: int):
    conn.execute("""
        UPDATE crawl_runs
        SET finished_at = ?, pages = ?, items = ?
        WHERE run_id = ?
    """, (datetime.utcnow().isoformat(), pages, items, run_id))
    conn.commit()

Deduplication with URL hashing

Before inserting an item, check if you have already scraped it, or use the upsert pattern above and let the database handle it.

def already_scraped(conn: sqlite3.Connection, url: str) -> bool:
    url_hash = get_url_hash(url)
    row = conn.execute(
        "SELECT 1 FROM books WHERE id = ?", (url_hash,)
    ).fetchone()
    return row is not None

For very large crawls (millions of URLs), a database lookup per URL is slow. Use a set in memory or a Bloom filter (the pybloom-live package) for fast membership testing.

When to use Postgres

SQLite is fine for:

  • Single-machine scrapers, Up to ~100 GB of data, One writer at a time

Upgrade to Postgres when:

  • Multiple workers write simultaneously (SQLite's WAL mode helps, but Postgres is better), You need full SQL power (window functions, CTEs, JSON operators), Data exceeds SQLite's practical limits, You want hosted, managed, automatically-backed-up storage

The upsert syntax is nearly identical in Postgres, just change the connection library from sqlite3 to psycopg2 or asyncpg.

When to use Parquet

Parquet is a columnar file format optimised for analytical queries. Use it when:

  • You need to run aggregations on large datasets (analytics, not just lookup), You want to load data into a data warehouse (BigQuery, Snowflake, DuckDB), Your data is naturally partitioned by date
pip install pyarrow pandas
import pandas as pd

books = [
    {"title": "Fluent Python", "price": 45.99, "rating": 5, "date": "2026-01-15"},
    {"title": "Clean Code",    "price": 32.50, "rating": 4, "date": "2026-01-15"},
]

df = pd.DataFrame(books)
# Partition by date, useful when you scrape daily
df.to_parquet("books/date=2026-01-15/data.parquet", index=False)

Hands-on

Let us put it all together: a books scraper that writes to SQLite with upserts, logs crawl metadata, and prints a summary.

import sqlite3
import hashlib
import requests
import time
import random
from bs4 import BeautifulSoup
from datetime import datetime

DB_PATH = "books.db"
BASE_URL = "https://books.toscrape.com"


def get_id(url: str) -> str:
    return hashlib.sha256(url.encode()).hexdigest()[:16]


def init_db(conn: sqlite3.Connection):
    conn.executescript("""
        CREATE TABLE IF NOT EXISTS books (
            id         TEXT PRIMARY KEY,
            url        TEXT UNIQUE NOT NULL,
            title      TEXT,
            price      REAL,
            rating     INTEGER,
            available  INTEGER,
            scraped_at TEXT
        );
        CREATE TABLE IF NOT EXISTS crawl_runs (
            run_id      INTEGER PRIMARY KEY AUTOINCREMENT,
            started_at  TEXT,
            finished_at TEXT,
            pages       INTEGER DEFAULT 0,
            items       INTEGER DEFAULT 0
        );
    """)
    conn.commit()


def upsert_book(conn: sqlite3.Connection, book: dict):
    conn.execute("""
        INSERT INTO books (id, url, title, price, rating, available, scraped_at)
        VALUES (:id, :url, :title, :price, :rating, :available, :scraped_at)
        ON CONFLICT(id) DO UPDATE SET
            title      = excluded.title,
            price      = excluded.price,
            rating     = excluded.rating,
            available  = excluded.available,
            scraped_at = excluded.scraped_at
    """, book)


RATING_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}

def parse_books_page(html: str, base_url: str) -> list[dict]:
    soup = BeautifulSoup(html, "lxml")
    books = []
    for article in soup.select("article.product_pod"):
        a = article.select_one("h3 > a")
        title = a.get("title", "") if a else ""
        href  = a.get("href", "") if a else ""
        url   = base_url + "/catalogue/" + href.replace("../", "")

        price_el = article.select_one("p.price_color")
        try:
            price = float(price_el.text.strip().lstrip("£Â"))
        except (AttributeError, ValueError):
            price = None

        rating_el = article.select_one("p.star-rating")
        word = rating_el.get("class", ["", ""])[1] if rating_el else ""
        rating = RATING_MAP.get(word, 0)

        avail_el = article.select_one("p.availability")
        available = 1 if avail_el and "In stock" in avail_el.text else 0

        books.append({
            "id":        get_id(url),
            "url":       url,
            "title":     title,
            "price":     price,
            "rating":    rating,
            "available": available,
            "scraped_at": datetime.utcnow().isoformat(),
        })
    return books


def crawl(max_pages: int = 5):
    session = requests.Session()
    session.headers["User-Agent"] = "BookScraper/1.0"

    conn = sqlite3.connect(DB_PATH)
    init_db(conn)

    run_id_row = conn.execute(
        "INSERT INTO crawl_runs (started_at) VALUES (?)",
        (datetime.utcnow().isoformat(),)
    )
    run_id = run_id_row.lastrowid
    conn.commit()

    total_pages = 0
    total_items = 0
    url = f"{BASE_URL}/catalogue/page-1.html"

    for page_num in range(1, max_pages + 1):
        time.sleep(random.uniform(1, 2))
        r = session.get(url, timeout=15)
        r.raise_for_status()

        books = parse_books_page(r.text, BASE_URL)
        for book in books:
            upsert_book(conn, book)
        conn.commit()

        total_pages += 1
        total_items += len(books)
        print(f"Page {page_num}: {len(books)} books stored (total {total_items})")

        soup = BeautifulSoup(r.text, "lxml")
        next_btn = soup.select_one("li.next > a")
        if not next_btn:
            break
        url = f"{BASE_URL}/catalogue/{next_btn.get('href', '')}"

    conn.execute("""
        UPDATE crawl_runs
        SET finished_at = ?, pages = ?, items = ?
        WHERE run_id = ?
    """, (datetime.utcnow().isoformat(), total_pages, total_items, run_id))
    conn.commit()

    # Summary query
    avg_price = conn.execute("SELECT AVG(price) FROM books").fetchone()[0]
    total_in_db = conn.execute("SELECT COUNT(*) FROM books").fetchone()[0]
    print(f"\nDatabase summary: {total_in_db} books total, avg price £{avg_price:.2f}")

    conn.close()


if __name__ == "__main__":
    crawl(max_pages=5)

Run it twice, you will see the same books updated, not duplicated.

Common pitfalls

  • Forgetting conn.commit(). SQLite in Python defaults to manual transaction management. If you do not commit, data is not written to disk and is lost when the connection closes. Alternatively, use the context manager: with sqlite3.connect(DB_PATH) as conn:.

  • Opening the file in append mode for CSV without writing headers. If you open in "a" mode, the header row gets written every time you restart the script. Check if the file is empty before writing headers.

  • Using str as a primary key without normalisation. URLs can have trailing slashes, http vs https, query strings in different orders. Hash the canonical URL to get a stable, compact primary key.

  • Not handling None values before SQL insertion. Python None maps to SQL NULL, which is correct. But if you do arithmetic on None (e.g. price * 0.9) before inserting, you get a Python TypeError. Sanitise data before the storage layer.

  • Storing everything in one giant table. A books table and a separate authors table with a foreign key is better than a wide books table with repeated author data. Normalise early; denormalising for analytics is easy, normalising a mess later is not.

  • Not partitioning by date in Parquet. Writing all data to one Parquet file means re-writing the entire file on every run. Partition by date (/date=YYYY-MM-DD/) so each day's data is in its own file and old data is never touched.

What to try next

  1. Add a SELECT query that returns the top 5 highest-rated books with price under £20. Print them formatted as a table using Python's tabulate library.

  2. Export the SQLite data to Parquet using pandas and pyarrow. Query the Parquet file with DuckDB (pip install duckdb) using SQL: SELECT AVG(price), rating FROM 'books/*.parquet' GROUP BY rating ORDER BY rating.

  3. Add a last_modified column to the books table. Update it on every upsert. Write a query that finds books whose price changed between the first and most recent scrape.

Test Your Knowledge

Take a quick quiz on this lesson

Start Quiz →

Prefer watching over reading?

Subscribe for free.

Subscribe on YouTube