Legal and Ethical Scraping
What you'll build
This lesson does not produce code, it produces judgment. By the end you will be able to read a robots.txt file and understand what it legally and practically means, identify when scraped data creates copyright or privacy liability, apply India's DPDP Act to a real scraping scenario, and apply a practical checklist before any scraping project. You will leave with a clear, honest position on what is fine, what is grey, and what is not acceptable.
Concepts
robots.txt, what it actually does (and does not do)
Every major website has a robots.txt at https://example.com/robots.txt. It is a plain text file that tells crawlers which paths they are allowed to visit.
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /public/
User-agent: Googlebot
Allow: /
Crawl-delay: 10
Key facts:
robots.txtis not legally binding. There is no law in India, the US, or the EU that says you must follow it., It is, however, strong evidence of the site's intent. Ignoring it when it saysDisallowis a clear signal to any court or regulator that you were not acting in good faith., Violatingrobots.txthas been used as supporting evidence in several US court cases to establish that access was "unauthorised" under the Computer Fraud and Abuse Act (CFAA).Crawl-delayis a polite signal. Respect it., Scrapy and most scraping frameworks respectrobots.txtby default (ROBOTSTXT_OBEY = True). Do not turn it off unless you have a clear reason.
Reading robots.txt in Python:
import urllib.robotparser
rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://quotes.toscrape.com/robots.txt")
rp.read()
can_fetch = rp.can_fetch("*", "https://quotes.toscrape.com/")
print("Can fetch /:", can_fetch) # True or False
Terms of Service
Most websites have Terms of Service (ToS) that prohibit automated data collection. The legal weight of ToS varies:
- In the US: hiQ Labs v. LinkedIn (9th Circuit, 2022) held that scraping publicly available data does not violate the CFAA, even if it violates ToS. This was a landmark ruling. But it applies narrowly, it covers data anyone can see without logging in.
- Behind a login: Accessing data behind a login wall in violation of ToS is much riskier. Courts have been more sympathetic to CFAA claims when a user actively agreed to ToS by creating an account.
- In India: India does not have a direct equivalent of the CFAA. However, the IT Act 2000 (Section 43) prohibits unauthorised access to computer systems without permission. Scraping in violation of explicit ToS, especially behind login, could be argued as unauthorised access.
Practical rule: ToS matters most when you are logged in. Publicly accessible data is much safer territory. But even for public data, violating ToS can lead to civil suits for breach of contract.
Copyright in scraped data
Scraping data is not the same as republishing it. The distinction matters enormously.
- Facts are not copyrightable, prices, addresses, names, and other factual data cannot be copyrighted (at least in most jurisdictions). You can scrape and use raw facts.
- Creative expression is copyrightable, product descriptions, reviews, articles, and original writing are protected. Republishing them verbatim is copyright infringement even if you scraped them.
- Database rights, the EU (and by extension, some arguments in Indian law) recognises "database rights" that protect the substantial investment in assembling a database. Scraping a competitor's entire product catalogue and republishing it could infringe database rights even if individual facts are not protected.
Practical rule: Use scraped data to train models, power your own analytics, or build derivative products. Do not reproduce scraped text verbatim and republish it.
Personal data, GDPR and India's DPDP Act
GDPR (EU, 2018):
- Applies if you scrape personal data about EU residents.
- "Personal data" is any information that can identify a person: names, email addresses, phone numbers, IP addresses, social media profiles., You need a legal basis (legitimate interest, consent, contract) to process personal data. "I found it on the web" is not a legal basis., If you build a database of scraped personal data, you may need to comply with data subject rights (access, erasure, portability).
India's Digital Personal Data Protection Act (DPDP Act, 2023):
- India's first comprehensive data privacy law. Rules are still being notified (as of 2026), but the core framework is in force., Applies to processing of digital personal data in India, and to processing of personal data of Indian residents, even if done abroad., A "data fiduciary" (the entity processing data) must have consent or a legitimate use under a "legitimate use" ground., Scraping personal data of Indian residents without consent and building a commercial database is likely to be non-compliant., Penalties: up to Rs. 250 crore per violation.
Practical rule: Do not scrape personal data (names, emails, phone numbers, addresses, social media profiles, medical data) and build databases of it without a clear legal basis. If in doubt, anonymise or do not collect.
Landmark cases, a quick survey
hiQ Labs v. LinkedIn (US, 9th Circuit 2022)
- hiQ scraped LinkedIn's public profile pages to build a workforce analytics product., LinkedIn sent a cease-and-desist. hiQ sued for access., The 9th Circuit held: scraping publicly available data is not a CFAA violation. But it also noted that LinkedIn could pursue other claims (breach of contract, tortious interference)., Lesson: public-data scraping is on stronger legal footing, but is not risk-free.
Ryanair v. PR Aviation (EU Court of Justice, 2015)
- PR Aviation scraped Ryanair's flight prices. Ryanair claimed database rights., The court held that database rights apply only if the database meets the "substantial investment" threshold. The mere fact of a ToS prohibition does not create database rights., Lesson: database rights claims require specific conditions.
Meta v. Bright Data (US, 2024)
- Meta sued Bright Data (a proxy/scraping provider) for accessing Facebook data., The court sided with Bright Data, reinforcing that scraping public data does not violate the CFAA on its own., Lesson: the trend in US courts has been towards protecting the right to scrape public data, but this does not mean ToS is irrelevant for all other claims.
Important: Case law changes. Cases are jurisdiction-specific. This is not legal advice. Consult a lawyer for anything with real commercial stakes.
The good-citizen checklist
Before starting any scraping project, run through this:
- Check robots.txt, does it disallow your target paths? If yes, have a clear justification for ignoring it.
- Read the ToS, is automated scraping explicitly prohibited? Is this data behind a login?
- Prefer the API, does the site offer a public API? Use it instead.
- Rate-limit yourself, add delays. Do not hammer the server. A reasonable rate is whatever a fast human would do.
- Identify yourself, set a User-Agent that includes your project name and contact info:
MyProject/1.0 (contact@yoursite.com). Give the site operator a way to reach you. - Do not scrape personal data unless you have a legal basis and a clear need.
- Do not redistribute scraped content verbatim if it is creative work (articles, reviews, descriptions).
- Do not scrape behind a login without explicit permission or a very clear legal basis.
- Do not overload, if the site goes slow, you are going too fast. Back off.
- Respect
Crawl-delay, ifrobots.txtspecifies a crawl delay, honour it.
The honest take
Scraping public data ethically is fine. The web was designed to be crawled, search engines have done it since 1994. If you are collecting facts from publicly accessible pages at a reasonable rate, identifying yourself, and using the data for legitimate purposes, you are well within the spirit of how the web works.
The line is crossed when you:
- Access data behind a login you are not supposed to have, Reproduce copyrighted creative content verbatim for commercial gain, Build personal data databases without legal basis, Cause measurable harm to the service through excessive load, Use automation to defraud, manipulate, or harm
Everything between the clearly fine and clearly wrong is a grey area. Legal opinions differ. Court decisions vary by jurisdiction. When the stakes are high, commercial use, personal data, competitor data, get a lawyer, not just a tutorial.
Hands-on
Let us check robots.txt for a few sites programmatically:
import urllib.robotparser
import requests
def check_robots(base_url: str, paths_to_check: list[str]):
rp = urllib.robotparser.RobotFileParser()
rp.set_url(f"{base_url}/robots.txt")
try:
rp.read()
except Exception as e:
print(f"Could not fetch robots.txt: {e}")
return
print(f"\nrobots.txt for {base_url}:")
for path in paths_to_check:
url = base_url + path
allowed = rp.can_fetch("*", url)
status = "ALLOWED" if allowed else "DISALLOWED"
print(f" {status}: {path}")
# Also print the crawl delay if specified
delay = rp.crawl_delay("*")
if delay:
print(f" Crawl-delay: {delay} seconds")
# Check a few scraping-friendly practice sites
check_robots("https://quotes.toscrape.com", ["/", "/page/2/", "/login"])
check_robots("https://books.toscrape.com", ["/", "/catalogue/", "/admin/"])
Now read the actual robots.txt for a site and apply the checklist:
import requests
def print_robots_txt(base_url: str):
try:
r = requests.get(f"{base_url}/robots.txt", timeout=10)
if r.status_code == 200:
print(f"\n--- robots.txt for {base_url} ---")
print(r.text[:2000]) # First 2000 chars
else:
print(f"No robots.txt (status {r.status_code})")
except requests.RequestException as e:
print(f"Error: {e}")
print_robots_txt("https://quotes.toscrape.com")
print_robots_txt("https://books.toscrape.com")
Walk through the checklist for a hypothetical project: "Scrape all product listings from an e-commerce site to build a price-comparison tool."
| Checklist item | Assessment |
|---|---|
| robots.txt allows the catalogue paths? | Check, assume yes |
| ToS permits automated access? | Often no, read carefully |
| Public API available? | If yes, use it |
| Rate limiting in place? | Add 2-3 second delays |
| Scraping personal data? | No, only product data |
| Redistributing copyrighted text? | Risk area if reproducing descriptions |
| Behind a login? | No, assume public prices |
Based on this: the scraping is likely legally viable if the site does not have a public API and you are using facts (prices, names) not creative descriptions. But check the ToS carefully before going commercial.
Common pitfalls
-
Assuming robots.txt is a legal permission slip. Being allowed by robots.txt does not mean you are legally allowed to scrape. ToS, copyright, and privacy law are separate issues.
-
Assuming ToS violations are criminal. Violating ToS is usually a civil matter (breach of contract), not a criminal one. The CFAA criminal provisions have been applied narrowly in recent US cases (especially after Van Buren v. US, 2021).
-
Thinking "it's public, so it's mine." Public data can still be copyrighted, covered by database rights, or contain personal data requiring compliance with privacy law.
-
Scraping personal data "because it's on the web." This reasoning did not hold under GDPR, and India's DPDP Act has similar principles. Scraped personal data is still personal data.
-
Not having a Data Protection Officer (DPO) when required. Large-scale processing of personal data under GDPR requires a DPO. If you are building a commercial product that processes personal data, understand your obligations.
-
Using scraped data to train AI models without checking licences. Many recent lawsuits target AI training on scraped data. The legal landscape is actively evolving. Do not assume training use is automatically permitted.
What to try next
-
Read the full
robots.txtof three sites you use regularly. Identify which paths are disallowed and what crawl-delay (if any) is specified. Does the disallowed path list tell you anything about the site's structure? -
Find the Terms of Service of a site you want to scrape. Locate the section about automated access or data mining. Summarise in one paragraph what it permits and prohibits.
-
Design a hypothetical scraping project: choose a use case, check robots.txt, read the ToS, identify if personal data is involved, and write a one-page ethical and legal assessment. This is the exercise that builds the judgment you actually need.
Prefer watching over reading?
Subscribe for free.