Scrapy Fundamentals
What you'll build
By the end of this lesson you will have a complete Scrapy project that crawls every page of quotes.toscrape.com, extracts quotes, authors, and tags, stores them via a pipeline, and respects the server with configurable throttling. You will also understand when Scrapy is the right tool, and when it is not.
Concepts
When Scrapy beats requests + BeautifulSoup
requests + BeautifulSoup is fine for one-off scripts and small crawls. Scrapy earns its place when you need any of these:
- Concurrent crawling, Scrapy uses Twisted (an async event loop) internally. It fires multiple requests in parallel without threading complexity.
- Automatic throttling,
AUTOTHROTTLEadjusts concurrency based on server response times. - Built-in deduplication, Scrapy tracks visited URLs and will not re-request them.
- Middleware / pipeline architecture, Clean separation between downloading, parsing, and storing data.
- Command-line tooling,
scrapy crawl,scrapy shell,scrapy check,scrapy bench.
Use requests + BeautifulSoup for: quick scripts, pages requiring complex session management, or when you are already deep in Python async code with httpx. Use Scrapy for: production crawlers, anything over 50 pages, or anything needing pipelines and scheduling.
Project layout
pip install scrapy
scrapy startproject quotesbot
This creates:
quotesbot/
scrapy.cfg
quotesbot/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
quotes_spider.py ← you write this
The spider class lives in spiders/. items.py defines data schemas. pipelines.py processes scraped items. settings.py controls all behaviour.
Spider anatomy
A spider is a Python class with a parse method and a start_urls list. Scrapy calls parse with a Response object for each starting URL.
# quotesbot/spiders/quotes_spider.py
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes" # unique spider name
start_urls = ["https://quotes.toscrape.com/"]
def parse(self, response):
# Extract data from the current page
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("a.tag::text").getall(),
}
# Follow the "Next" link
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
Key points:
response.css("selector::text"), Scrapy's built-in CSS selectors with text pseudo-element.get(), returns first match orNone.getall(), returns all matches as a listyield {dict}, yields a dict as a scraped itemresponse.follow(url, callback), queues the next request
Run it:
scrapy crawl quotes -o quotes.json
Scrapy writes each yielded item to quotes.json automatically.
Items and ItemLoader
Dict yields are fine for prototyping. For production, define an Item class so you have a schema.
# quotesbot/items.py
import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
url = scrapy.Field()
# In the spider, use Item instead of dict
from quotesbot.items import QuoteItem
def parse(self, response):
for quote in response.css("div.quote"):
item = QuoteItem()
item["text"] = quote.css("span.text::text").get("").strip()
item["author"] = quote.css("small.author::text").get("").strip()
item["tags"] = quote.css("a.tag::text").getall()
item["url"] = response.url
yield item
ItemLoader is an optional layer that applies input/output processors (stripping whitespace, joining lists, etc.) declaratively. For large projects it keeps parsing logic clean.
Pipelines
Pipelines process each item after it leaves the spider. A pipeline is a class with a process_item method.
# quotesbot/pipelines.py
import json
class JsonWriterPipeline:
def open_spider(self, spider):
self.file = open("quotes_output.jsonl", "w", encoding="utf-8")
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item), ensure_ascii=False) + "\n"
self.file.write(line)
return item # always return item or raise DropItem
Enable it in settings.py:
ITEM_PIPELINES = {
"quotesbot.pipelines.JsonWriterPipeline": 300, # priority 1-1000
}
Lower number = runs earlier. You can chain multiple pipelines, for example, a deduplication pipeline at priority 100 and a storage pipeline at 300.
Key settings
# quotesbot/settings.py
BOT_NAME = "quotesbot"
USER_AGENT = "quotesbot/1.0 (+https://your-site.com)"
ROBOTSTXT_OBEY = True # respect robots.txt, leave this True
CONCURRENT_REQUESTS = 4 # simultaneous requests (default 16 is too aggressive)
DOWNLOAD_DELAY = 1 # minimum seconds between requests per domain
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0 # avg requests per server response
HTTPCACHE_ENABLED = True # cache responses during development
HTTPCACHE_EXPIRATION_SECS = 3600
AUTOTHROTTLE is Scrapy's killer feature for being a polite crawler. It measures server latency and automatically slows down when the server is struggling.
Hands-on
Here is the complete project. Create the directory structure, copy each file, and run the spider.
scrapy startproject quotesbot
cd quotesbot
quotesbot/items.py:
import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
page = scrapy.Field()
quotesbot/spiders/quotes_spider.py:
import scrapy
from quotesbot.items import QuoteItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
def parse(self, response):
page = response.url.rstrip("/").split("/")[-1]
page_num = int(page) if page.isdigit() else 1
for quote_div in response.css("div.quote"):
item = QuoteItem()
item["text"] = quote_div.css("span.text::text").get("").strip()
item["author"] = quote_div.css("small.author::text").get("").strip()
item["tags"] = quote_div.css("a.tag::text").getall()
item["page"] = page_num
yield item
self.logger.info(
f"Scraped page {page_num}, {len(response.css('div.quote'))} quotes"
)
next_href = response.css("li.next a::attr(href)").get()
if next_href:
yield response.follow(next_href, callback=self.parse)
quotesbot/settings.py (add/change these values):
BOT_NAME = "quotesbot"
SPIDER_MODULES = ["quotesbot.spiders"]
NEWSPIDER_MODULE = "quotesbot.spiders"
USER_AGENT = "quotesbot/1.0"
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 2
DOWNLOAD_DELAY = 1
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
HTTPCACHE_ENABLED = True
ITEM_PIPELINES = {
"scrapy.pipelines.files.FilesPipeline": 1, # disabled, example only
}
Run and save to JSON Lines:
scrapy crawl quotes -o quotes.jsonl
Check the output:
head -3 quotes.jsonl
{"text": "“The world as we have created it is a process of our thinking.”", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"], "page": 1}
{"text": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”", "author": "J.K. Rowling", "tags": ["abilities", "choices"], "page": 1}
Use scrapy shell for interactive exploration, it is indispensable for figuring out selectors:
scrapy shell "https://quotes.toscrape.com/"
# In the shell:
# response.css("div.quote span.text::text").getall()[:3]
Common pitfalls
-
Setting
CONCURRENT_REQUESTS = 16(the default) in production. The default is designed for performance testing, not politeness. For real sites start at 2-4 and increase only if the server handles it. -
Forgetting
return itemin a pipeline. Ifprocess_itemdoes not return the item (or raiseDropItem), the item silently disappears. Subsequent pipelines never see it. -
Using relative URLs directly.
response.follow()handles relative URLs correctly. If you build absolute URLs yourself by string concatenation, you will mess up URLs that start with//or./. -
Enabling
HTTPCACHEin production. Caching is great during development but should be disabled in production crawls where you want fresh data. SetHTTPCACHE_ENABLED = Falseor pass-s HTTPCACHE_ENABLED=0on the command line. -
Not setting
ROBOTSTXT_OBEY = True. Scrapy respectsrobots.txtby default. Some tutorials turn it off. Do not turn it off unless you have a specific, justified reason. -
Spider name conflicts. Every spider in a project must have a unique
nameattribute. Runningscrapy crawlwith the wrong name gives aKeyErrorthat looks confusing the first time.
What to try next
-
Add an
AuthorItemand a second parse method (parse_author) that follows each author's profile link (/author/<name>) and extracts their bio and birth date. Useyield response.follow(author_link, callback=self.parse_author). -
Write a
DeduplicationPipelinethat drops anyQuoteItemwhosetexthas already been seen (using aset). Chain it at priority 100, before the output pipeline at 300. -
Export the crawl data to SQLite using a pipeline that inserts rows into a
quotestable. Usesqlite3from the standard library.
Prefer watching over reading?
Subscribe for free.