Storing Scraped Data
What you'll build
By the end of this lesson you will have a storage layer that writes scraped data to SQLite with idempotent upserts (safe to re-run without duplicates), tracks each crawl run in a metadata table, and knows when to graduate to Postgres or Parquet. You will also build a simple deduplication system using URL hashing.
Concepts
CSV, the right first format
For quick experiments and data sharing, CSV is hard to beat. Python's csv module handles it without dependencies.
import csv
books = [
{"title": "Fluent Python", "price": 45.99, "rating": 5},
{"title": "Clean Code", "price": 32.50, "rating": 4},
]
# Writing
with open("books.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["title", "price", "rating"])
writer.writeheader()
writer.writerows(books)
# Reading back
with open("books.csv", "r", newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
print(row)
Always pass newline="" when opening files for csv. Without it, the CSV writer adds an extra blank line on Windows. Always pass encoding="utf-8", never rely on the system default.
CSV weaknesses: no types (everything is a string), no schema enforcement, appending without duplicating headers is awkward, and it breaks on commas/newlines in values unless properly quoted (the csv module handles this, but naive string concatenation does not).
JSON Lines, streaming structured data
JSON Lines (.jsonl) is one JSON object per line. It is streamable, appendable, and typed.
import json
data = [
{"url": "https://example.com/1", "title": "Page One", "scraped_at": "2026-01-15"},
{"url": "https://example.com/2", "title": "Page Two", "scraped_at": "2026-01-15"},
]
# Write
with open("output.jsonl", "w", encoding="utf-8") as f:
for record in data:
f.write(json.dumps(record, ensure_ascii=False) + "\n")
# Read, one line at a time (memory efficient for large files)
with open("output.jsonl", "r", encoding="utf-8") as f:
for line in f:
obj = json.loads(line.strip())
print(obj["title"])
Use ensure_ascii=False so non-ASCII characters (Hindi, Tamil, Chinese, etc.) are stored as-is rather than as \uXXXX escape sequences.
SQLite, the right choice for local persistence
SQLite is a file-based SQL database built into Python. No server to install. Perfect for scrapers running on a single machine.
import sqlite3
import hashlib
from datetime import datetime
DB_PATH = "scraped.db"
def get_url_hash(url: str) -> str:
"""Stable identifier for a URL, used as a deduplication key."""
return hashlib.sha256(url.encode()).hexdigest()[:16]
def init_db(conn: sqlite3.Connection):
conn.executescript("""
CREATE TABLE IF NOT EXISTS books (
id TEXT PRIMARY KEY, -- URL hash
url TEXT UNIQUE NOT NULL,
title TEXT NOT NULL,
price REAL,
rating INTEGER,
available INTEGER, -- 0 or 1 (SQLite has no boolean)
scraped_at TEXT NOT NULL
);
CREATE TABLE IF NOT EXISTS crawl_runs (
run_id INTEGER PRIMARY KEY AUTOINCREMENT,
spider TEXT NOT NULL,
started_at TEXT NOT NULL,
finished_at TEXT,
pages INTEGER DEFAULT 0,
items INTEGER DEFAULT 0
);
""")
conn.commit()
def upsert_book(conn: sqlite3.Connection, book: dict):
"""Insert or replace a book record, safe to call multiple times."""
url_hash = get_url_hash(book["url"])
conn.execute("""
INSERT INTO books (id, url, title, price, rating, available, scraped_at)
VALUES (?, ?, ?, ?, ?, ?, ?)
ON CONFLICT(id) DO UPDATE SET
title = excluded.title,
price = excluded.price,
rating = excluded.rating,
available = excluded.available,
scraped_at = excluded.scraped_at
""", (
url_hash,
book["url"],
book["title"],
book.get("price"),
book.get("rating"),
int(book.get("available", True)),
datetime.utcnow().isoformat(),
))
The ON CONFLICT ... DO UPDATE clause is an "upsert". If a book with the same id already exists, it updates the fields instead of failing. This makes your scraper safe to re-run, you will not end up with duplicate rows.
The crawl_runs table
Always log metadata about each scrape run. You will thank yourself when your scraper breaks at 3 AM.
def start_run(conn: sqlite3.Connection, spider_name: str) -> int:
cursor = conn.execute(
"INSERT INTO crawl_runs (spider, started_at) VALUES (?, ?)",
(spider_name, datetime.utcnow().isoformat()),
)
conn.commit()
return cursor.lastrowid
def finish_run(conn: sqlite3.Connection, run_id: int, pages: int, items: int):
conn.execute("""
UPDATE crawl_runs
SET finished_at = ?, pages = ?, items = ?
WHERE run_id = ?
""", (datetime.utcnow().isoformat(), pages, items, run_id))
conn.commit()
Deduplication with URL hashing
Before inserting an item, check if you have already scraped it, or use the upsert pattern above and let the database handle it.
def already_scraped(conn: sqlite3.Connection, url: str) -> bool:
url_hash = get_url_hash(url)
row = conn.execute(
"SELECT 1 FROM books WHERE id = ?", (url_hash,)
).fetchone()
return row is not None
For very large crawls (millions of URLs), a database lookup per URL is slow. Use a set in memory or a Bloom filter (the pybloom-live package) for fast membership testing.
When to use Postgres
SQLite is fine for:
- Single-machine scrapers, Up to ~100 GB of data, One writer at a time
Upgrade to Postgres when:
- Multiple workers write simultaneously (SQLite's WAL mode helps, but Postgres is better), You need full SQL power (window functions, CTEs, JSON operators), Data exceeds SQLite's practical limits, You want hosted, managed, automatically-backed-up storage
The upsert syntax is nearly identical in Postgres, just change the connection library from sqlite3 to psycopg2 or asyncpg.
When to use Parquet
Parquet is a columnar file format optimised for analytical queries. Use it when:
- You need to run aggregations on large datasets (analytics, not just lookup), You want to load data into a data warehouse (BigQuery, Snowflake, DuckDB), Your data is naturally partitioned by date
pip install pyarrow pandas
import pandas as pd
books = [
{"title": "Fluent Python", "price": 45.99, "rating": 5, "date": "2026-01-15"},
{"title": "Clean Code", "price": 32.50, "rating": 4, "date": "2026-01-15"},
]
df = pd.DataFrame(books)
# Partition by date, useful when you scrape daily
df.to_parquet("books/date=2026-01-15/data.parquet", index=False)
Hands-on
Let us put it all together: a books scraper that writes to SQLite with upserts, logs crawl metadata, and prints a summary.
import sqlite3
import hashlib
import requests
import time
import random
from bs4 import BeautifulSoup
from datetime import datetime
DB_PATH = "books.db"
BASE_URL = "https://books.toscrape.com"
def get_id(url: str) -> str:
return hashlib.sha256(url.encode()).hexdigest()[:16]
def init_db(conn: sqlite3.Connection):
conn.executescript("""
CREATE TABLE IF NOT EXISTS books (
id TEXT PRIMARY KEY,
url TEXT UNIQUE NOT NULL,
title TEXT,
price REAL,
rating INTEGER,
available INTEGER,
scraped_at TEXT
);
CREATE TABLE IF NOT EXISTS crawl_runs (
run_id INTEGER PRIMARY KEY AUTOINCREMENT,
started_at TEXT,
finished_at TEXT,
pages INTEGER DEFAULT 0,
items INTEGER DEFAULT 0
);
""")
conn.commit()
def upsert_book(conn: sqlite3.Connection, book: dict):
conn.execute("""
INSERT INTO books (id, url, title, price, rating, available, scraped_at)
VALUES (:id, :url, :title, :price, :rating, :available, :scraped_at)
ON CONFLICT(id) DO UPDATE SET
title = excluded.title,
price = excluded.price,
rating = excluded.rating,
available = excluded.available,
scraped_at = excluded.scraped_at
""", book)
RATING_MAP = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
def parse_books_page(html: str, base_url: str) -> list[dict]:
soup = BeautifulSoup(html, "lxml")
books = []
for article in soup.select("article.product_pod"):
a = article.select_one("h3 > a")
title = a.get("title", "") if a else ""
href = a.get("href", "") if a else ""
url = base_url + "/catalogue/" + href.replace("../", "")
price_el = article.select_one("p.price_color")
try:
price = float(price_el.text.strip().lstrip("£Â"))
except (AttributeError, ValueError):
price = None
rating_el = article.select_one("p.star-rating")
word = rating_el.get("class", ["", ""])[1] if rating_el else ""
rating = RATING_MAP.get(word, 0)
avail_el = article.select_one("p.availability")
available = 1 if avail_el and "In stock" in avail_el.text else 0
books.append({
"id": get_id(url),
"url": url,
"title": title,
"price": price,
"rating": rating,
"available": available,
"scraped_at": datetime.utcnow().isoformat(),
})
return books
def crawl(max_pages: int = 5):
session = requests.Session()
session.headers["User-Agent"] = "BookScraper/1.0"
conn = sqlite3.connect(DB_PATH)
init_db(conn)
run_id_row = conn.execute(
"INSERT INTO crawl_runs (started_at) VALUES (?)",
(datetime.utcnow().isoformat(),)
)
run_id = run_id_row.lastrowid
conn.commit()
total_pages = 0
total_items = 0
url = f"{BASE_URL}/catalogue/page-1.html"
for page_num in range(1, max_pages + 1):
time.sleep(random.uniform(1, 2))
r = session.get(url, timeout=15)
r.raise_for_status()
books = parse_books_page(r.text, BASE_URL)
for book in books:
upsert_book(conn, book)
conn.commit()
total_pages += 1
total_items += len(books)
print(f"Page {page_num}: {len(books)} books stored (total {total_items})")
soup = BeautifulSoup(r.text, "lxml")
next_btn = soup.select_one("li.next > a")
if not next_btn:
break
url = f"{BASE_URL}/catalogue/{next_btn.get('href', '')}"
conn.execute("""
UPDATE crawl_runs
SET finished_at = ?, pages = ?, items = ?
WHERE run_id = ?
""", (datetime.utcnow().isoformat(), total_pages, total_items, run_id))
conn.commit()
# Summary query
avg_price = conn.execute("SELECT AVG(price) FROM books").fetchone()[0]
total_in_db = conn.execute("SELECT COUNT(*) FROM books").fetchone()[0]
print(f"\nDatabase summary: {total_in_db} books total, avg price £{avg_price:.2f}")
conn.close()
if __name__ == "__main__":
crawl(max_pages=5)
Run it twice, you will see the same books updated, not duplicated.
Common pitfalls
-
Forgetting
conn.commit(). SQLite in Python defaults to manual transaction management. If you do not commit, data is not written to disk and is lost when the connection closes. Alternatively, use the context manager:with sqlite3.connect(DB_PATH) as conn:. -
Opening the file in append mode for CSV without writing headers. If you open in
"a"mode, the header row gets written every time you restart the script. Check if the file is empty before writing headers. -
Using
stras a primary key without normalisation. URLs can have trailing slashes,httpvshttps, query strings in different orders. Hash the canonical URL to get a stable, compact primary key. -
Not handling
Nonevalues before SQL insertion. PythonNonemaps to SQLNULL, which is correct. But if you do arithmetic onNone(e.g.price * 0.9) before inserting, you get a PythonTypeError. Sanitise data before the storage layer. -
Storing everything in one giant table. A
bookstable and a separateauthorstable with a foreign key is better than a widebookstable with repeated author data. Normalise early; denormalising for analytics is easy, normalising a mess later is not. -
Not partitioning by date in Parquet. Writing all data to one Parquet file means re-writing the entire file on every run. Partition by date (
/date=YYYY-MM-DD/) so each day's data is in its own file and old data is never touched.
What to try next
-
Add a
SELECTquery that returns the top 5 highest-rated books with price under £20. Print them formatted as a table using Python'stabulatelibrary. -
Export the SQLite data to Parquet using
pandasandpyarrow. Query the Parquet file with DuckDB (pip install duckdb) using SQL:SELECT AVG(price), rating FROM 'books/*.parquet' GROUP BY rating ORDER BY rating. -
Add a
last_modifiedcolumn to thebookstable. Update it on every upsert. Write a query that finds books whose price changed between the first and most recent scrape.
Prefer watching over reading?
Subscribe for free.