Lesson 4 of 107 min read

Scrapy Fundamentals

Share:WhatsAppLinkedIn

What you'll build

By the end of this lesson you will have a complete Scrapy project that crawls every page of quotes.toscrape.com, extracts quotes, authors, and tags, stores them via a pipeline, and respects the server with configurable throttling. You will also understand when Scrapy is the right tool, and when it is not.

Concepts

When Scrapy beats requests + BeautifulSoup

requests + BeautifulSoup is fine for one-off scripts and small crawls. Scrapy earns its place when you need any of these:

  • Concurrent crawling, Scrapy uses Twisted (an async event loop) internally. It fires multiple requests in parallel without threading complexity.
  • Automatic throttling, AUTOTHROTTLE adjusts concurrency based on server response times.
  • Built-in deduplication, Scrapy tracks visited URLs and will not re-request them.
  • Middleware / pipeline architecture, Clean separation between downloading, parsing, and storing data.
  • Command-line tooling, scrapy crawl, scrapy shell, scrapy check, scrapy bench.

Use requests + BeautifulSoup for: quick scripts, pages requiring complex session management, or when you are already deep in Python async code with httpx. Use Scrapy for: production crawlers, anything over 50 pages, or anything needing pipelines and scheduling.

Project layout

pip install scrapy
scrapy startproject quotesbot

This creates:

quotesbot/
  scrapy.cfg
  quotesbot/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
      __init__.py
      quotes_spider.py   ← you write this

The spider class lives in spiders/. items.py defines data schemas. pipelines.py processes scraped items. settings.py controls all behaviour.

Spider anatomy

A spider is a Python class with a parse method and a start_urls list. Scrapy calls parse with a Response object for each starting URL.

# quotesbot/spiders/quotes_spider.py
import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"                           # unique spider name
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        # Extract data from the current page
        for quote in response.css("div.quote"):
            yield {
                "text":   quote.css("span.text::text").get(),
                "author": quote.css("small.author::text").get(),
                "tags":   quote.css("a.tag::text").getall(),
            }

        # Follow the "Next" link
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

Key points:

  • response.css("selector::text"), Scrapy's built-in CSS selectors with text pseudo-element
  • .get(), returns first match or None
  • .getall(), returns all matches as a list
  • yield {dict}, yields a dict as a scraped item
  • response.follow(url, callback), queues the next request

Run it:

scrapy crawl quotes -o quotes.json

Scrapy writes each yielded item to quotes.json automatically.

Items and ItemLoader

Dict yields are fine for prototyping. For production, define an Item class so you have a schema.

# quotesbot/items.py
import scrapy

class QuoteItem(scrapy.Item):
    text   = scrapy.Field()
    author = scrapy.Field()
    tags   = scrapy.Field()
    url    = scrapy.Field()
# In the spider, use Item instead of dict
from quotesbot.items import QuoteItem

def parse(self, response):
    for quote in response.css("div.quote"):
        item = QuoteItem()
        item["text"]   = quote.css("span.text::text").get("").strip()
        item["author"] = quote.css("small.author::text").get("").strip()
        item["tags"]   = quote.css("a.tag::text").getall()
        item["url"]    = response.url
        yield item

ItemLoader is an optional layer that applies input/output processors (stripping whitespace, joining lists, etc.) declaratively. For large projects it keeps parsing logic clean.

Pipelines

Pipelines process each item after it leaves the spider. A pipeline is a class with a process_item method.

# quotesbot/pipelines.py
import json

class JsonWriterPipeline:
    def open_spider(self, spider):
        self.file = open("quotes_output.jsonl", "w", encoding="utf-8")

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(line)
        return item   # always return item or raise DropItem

Enable it in settings.py:

ITEM_PIPELINES = {
    "quotesbot.pipelines.JsonWriterPipeline": 300,  # priority 1-1000
}

Lower number = runs earlier. You can chain multiple pipelines, for example, a deduplication pipeline at priority 100 and a storage pipeline at 300.

Key settings

# quotesbot/settings.py

BOT_NAME = "quotesbot"
USER_AGENT = "quotesbot/1.0 (+https://your-site.com)"

ROBOTSTXT_OBEY = True          # respect robots.txt, leave this True

CONCURRENT_REQUESTS = 4        # simultaneous requests (default 16 is too aggressive)
DOWNLOAD_DELAY = 1             # minimum seconds between requests per domain

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0  # avg requests per server response

HTTPCACHE_ENABLED = True       # cache responses during development
HTTPCACHE_EXPIRATION_SECS = 3600

AUTOTHROTTLE is Scrapy's killer feature for being a polite crawler. It measures server latency and automatically slows down when the server is struggling.

Hands-on

Here is the complete project. Create the directory structure, copy each file, and run the spider.

scrapy startproject quotesbot
cd quotesbot

quotesbot/items.py:

import scrapy

class QuoteItem(scrapy.Item):
    text   = scrapy.Field()
    author = scrapy.Field()
    tags   = scrapy.Field()
    page   = scrapy.Field()

quotesbot/spiders/quotes_spider.py:

import scrapy
from quotesbot.items import QuoteItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ["https://quotes.toscrape.com/"]

    def parse(self, response):
        page = response.url.rstrip("/").split("/")[-1]
        page_num = int(page) if page.isdigit() else 1

        for quote_div in response.css("div.quote"):
            item = QuoteItem()
            item["text"]   = quote_div.css("span.text::text").get("").strip()
            item["author"] = quote_div.css("small.author::text").get("").strip()
            item["tags"]   = quote_div.css("a.tag::text").getall()
            item["page"]   = page_num
            yield item

        self.logger.info(
            f"Scraped page {page_num}, {len(response.css('div.quote'))} quotes"
        )

        next_href = response.css("li.next a::attr(href)").get()
        if next_href:
            yield response.follow(next_href, callback=self.parse)

quotesbot/settings.py (add/change these values):

BOT_NAME = "quotesbot"
SPIDER_MODULES = ["quotesbot.spiders"]
NEWSPIDER_MODULE = "quotesbot.spiders"

USER_AGENT = "quotesbot/1.0"
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 2
DOWNLOAD_DELAY = 1
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
HTTPCACHE_ENABLED = True
ITEM_PIPELINES = {
    "scrapy.pipelines.files.FilesPipeline": 1,   # disabled, example only
}

Run and save to JSON Lines:

scrapy crawl quotes -o quotes.jsonl

Check the output:

head -3 quotes.jsonl
{"text": "“The world as we have created it is a process of our thinking.”", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"], "page": 1}
{"text": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”", "author": "J.K. Rowling", "tags": ["abilities", "choices"], "page": 1}

Use scrapy shell for interactive exploration, it is indispensable for figuring out selectors:

scrapy shell "https://quotes.toscrape.com/"
# In the shell:
# response.css("div.quote span.text::text").getall()[:3]

Common pitfalls

  • Setting CONCURRENT_REQUESTS = 16 (the default) in production. The default is designed for performance testing, not politeness. For real sites start at 2-4 and increase only if the server handles it.

  • Forgetting return item in a pipeline. If process_item does not return the item (or raise DropItem), the item silently disappears. Subsequent pipelines never see it.

  • Using relative URLs directly. response.follow() handles relative URLs correctly. If you build absolute URLs yourself by string concatenation, you will mess up URLs that start with // or ./.

  • Enabling HTTPCACHE in production. Caching is great during development but should be disabled in production crawls where you want fresh data. Set HTTPCACHE_ENABLED = False or pass -s HTTPCACHE_ENABLED=0 on the command line.

  • Not setting ROBOTSTXT_OBEY = True. Scrapy respects robots.txt by default. Some tutorials turn it off. Do not turn it off unless you have a specific, justified reason.

  • Spider name conflicts. Every spider in a project must have a unique name attribute. Running scrapy crawl with the wrong name gives a KeyError that looks confusing the first time.

What to try next

  1. Add an AuthorItem and a second parse method (parse_author) that follows each author's profile link (/author/<name>) and extracts their bio and birth date. Use yield response.follow(author_link, callback=self.parse_author).

  2. Write a DeduplicationPipeline that drops any QuoteItem whose text has already been seen (using a set). Chain it at priority 100, before the output pipeline at 300.

  3. Export the crawl data to SQLite using a pipeline that inserts rows into a quotes table. Use sqlite3 from the standard library.

Test Your Knowledge

Take a quick quiz on this lesson

Start Quiz →

Prefer watching over reading?

Subscribe for free.

Subscribe on YouTube