Scrapy Fundamentals

What you'll build

By the end of this lesson you will have a complete Scrapy project that crawls every page of quotes.toscrape.com, extracts quotes, authors, and tags, stores them via a pipeline, and respects the server with configurable throttling. You will also understand when Scrapy is the right tool, and when it is not.

Concepts

When Scrapy beats requests + BeautifulSoup

requests + BeautifulSoup is fine for one-off scripts and small crawls. Scrapy earns its place when you need any of these:

Concurrent crawling, Scrapy uses Twisted (an async event loop) internally. It fires multiple requests in parallel without threading complexity.
Automatic throttling, AUTOTHROTTLE adjusts concurrency based on server response times.
Built-in deduplication, Scrapy tracks visited URLs and will not re-request them.
Middleware / pipeline architecture, Clean separation between downloading, parsing, and storing data.
Command-line tooling, scrapy crawl, scrapy shell, scrapy check, scrapy bench.

Use requests + BeautifulSoup for: quick scripts, pages requiring complex session management, or when you are already deep in Python async code with httpx. Use Scrapy for: production crawlers, anything over 50 pages, or anything needing pipelines and scheduling.

Project layout

pip install scrapy
scrapy startproject quotesbot

This creates:

quotesbot/
  scrapy.cfg
  quotesbot/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
      __init__.py
      quotes_spider.py   ← you write this

The spider class lives in spiders/. items.py defines data schemas. pipelines.py processes scraped items. settings.py controls all behaviour.

Spider anatomy

A spider is a Python class with a parse method and a start_urls list. Scrapy calls parse with a Response object for each starting URL.

# quotesbot/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"                           # unique spider name
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        # Extract data from the current page
        for quote in response.css("div.quote"):
            yield {
                "text":   quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags":   quote.css("a.tag::text").getall(),
            }

        # Follow the "Next" link
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Key points:

response.css("selector::text"), Scrapy's built-in CSS selectors with text pseudo-element
.get(), returns first match or None
.getall(), returns all matches as a list
yield {dict}, yields a dict as a scraped item
response.follow(url, callback), queues the next request

Run it:

scrapy crawl quotes -o quotes.json

Scrapy writes each yielded item to quotes.json automatically.

Items and ItemLoader

Dict yields are fine for prototyping. For production, define an Item class so you have a schema.

# quotesbot/items.py
import scrapy

class QuoteItem(scrapy.Item):
    text   = scrapy.Field()
    author = scrapy.Field()
    tags   = scrapy.Field()
    url    = scrapy.Field()

# In the spider, use Item instead of dict
from quotesbot.items import QuoteItem

def parse(self, response):
    for quote in response.css("div.quote"):
        item = QuoteItem()
        item["text"]   = quote.css("span.text::text").get("").strip()
        item["author"] = quote.css("small.author::text").get("").strip()
        item["tags"]   = quote.css("a.tag::text").getall()
        item["url"]    = response.url
        yield item

ItemLoader is an optional layer that applies input/output processors (stripping whitespace, joining lists, etc.) declaratively. For large projects it keeps parsing logic clean.

Pipelines

Pipelines process each item after it leaves the spider. A pipeline is a class with a process_item method.

# quotesbot/pipelines.py
import json

class JsonWriterPipeline:
    def open_spider(self, spider):
        self.file = open("quotes_output.jsonl", "w", encoding="utf-8")

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item   # always return item or raise DropItem

Enable it in settings.py:

ITEM_PIPELINES = {
    "quotesbot.pipelines.JsonWriterPipeline": 300,  # priority 1-1000
}

Lower number = runs earlier. You can chain multiple pipelines, for example, a deduplication pipeline at priority 100 and a storage pipeline at 300.

Key settings

# quotesbot/settings.py

BOT_NAME = "quotesbot"
USER_AGENT = "quotesbot/1.0 (+https://your-site.com)"

ROBOTSTXT_OBEY = True          # respect robots.txt, leave this True

CONCURRENT_REQUESTS = 4        # simultaneous requests (default 16 is too aggressive)
DOWNLOAD_DELAY = 1             # minimum seconds between requests per domain

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0  # avg requests per server response

HTTPCACHE_ENABLED = True       # cache responses during development
HTTPCACHE_EXPIRATION_SECS = 3600

AUTOTHROTTLE is Scrapy's killer feature for being a polite crawler. It measures server latency and automatically slows down when the server is struggling.

Hands-on

Here is the complete project. Create the directory structure, copy each file, and run the spider.

scrapy startproject quotesbot
cd quotesbot

quotesbot/items.py:

import scrapy

class QuoteItem(scrapy.Item):
    text   = scrapy.Field()
    author = scrapy.Field()
    tags   = scrapy.Field()
    page   = scrapy.Field()

quotesbot/spiders/quotes_spider.py:

import scrapy
from quotesbot.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        page = response.url.rstrip("/").split("/")[-1]
        page_num = int(page) if page.isdigit() else 1

        for quote_div in response.css("div.quote"):
            item = QuoteItem()
            item["text"]   = quote_div.css("span.text::text").get("").strip()
            item["author"] = quote_div.css("small.author::text").get("").strip()
            item["tags"]   = quote_div.css("a.tag::text").getall()
            item["page"]   = page_num
            yield item

        self.logger.info(
            f"Scraped page {page_num}, {len(response.css('div.quote'))} quotes"
        )

        next_href = response.css("li.next a::attr(href)").get()
        if next_href:
            yield response.follow(next_href, callback=self.parse)

quotesbot/settings.py (add/change these values):

BOT_NAME = "quotesbot"
SPIDER_MODULES = ["quotesbot.spiders"]
NEWSPIDER_MODULE = "quotesbot.spiders"

USER_AGENT = "quotesbot/1.0"
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 2
DOWNLOAD_DELAY = 1
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
HTTPCACHE_ENABLED = True
ITEM_PIPELINES = {
    "scrapy.pipelines.files.FilesPipeline": 1,   # disabled, example only
}

Run and save to JSON Lines:

scrapy crawl quotes -o quotes.jsonl

Check the output:

head -3 quotes.jsonl

{"text": "“The world as we have created it is a process of our thinking.”", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"], "page": 1}
{"text": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”", "author": "J.K. Rowling", "tags": ["abilities", "choices"], "page": 1}

Use scrapy shell for interactive exploration, it is indispensable for figuring out selectors:

scrapy shell "https://quotes.toscrape.com/"
# In the shell:
# response.css("div.quote span.text::text").getall()[:3]

Common pitfalls

Setting CONCURRENT_REQUESTS = 16 (the default) in production. The default is designed for performance testing, not politeness. For real sites start at 2-4 and increase only if the server handles it.
Forgetting return item in a pipeline. If process_item does not return the item (or raise DropItem), the item silently disappears. Subsequent pipelines never see it.
Using relative URLs directly. response.follow() handles relative URLs correctly. If you build absolute URLs yourself by string concatenation, you will mess up URLs that start with // or ./.
Enabling HTTPCACHE in production. Caching is great during development but should be disabled in production crawls where you want fresh data. Set HTTPCACHE_ENABLED = False or pass -s HTTPCACHE_ENABLED=0 on the command line.
Not setting ROBOTSTXT_OBEY = True. Scrapy respects robots.txt by default. Some tutorials turn it off. Do not turn it off unless you have a specific, justified reason.
Spider name conflicts. Every spider in a project must have a unique name attribute. Running scrapy crawl with the wrong name gives a KeyError that looks confusing the first time.

What to try next

Add an AuthorItem and a second parse method (parse_author) that follows each author's profile link (/author/<name>) and extracts their bio and birth date. Use yield response.follow(author_link, callback=self.parse_author).
Write a DeduplicationPipeline that drops any QuoteItem whose text has already been seen (using a set). Chain it at priority 100, before the output pipeline at 300.
Export the crawl data to SQLite using a pipeline that inserts rows into a quotes table. Use sqlite3 from the standard library.