Lesson 7 of 108 min read

Anti-Detection Techniques

Share:WhatsAppLinkedIn

What you'll build

By the end of this lesson you will understand the full spectrum of anti-detection techniques, from simple header rotation to TLS fingerprinting and browser stealth, and you will be able to make deliberate, informed decisions about which to use. This lesson explicitly addresses where reasonable scraper disguise ends and unethical access-control circumvention begins. That distinction matters.

Concepts

Why detection exists

Websites implement bot detection for legitimate reasons: protecting server resources, preventing data scraping that violates their ToS, stopping automated account creation, and blocking credential-stuffing attacks. Some of this is reasonable; some of it goes too far.

Scraper "anti-detection" is a spectrum:

  1. Looking like a normal browser, setting a realistic User-Agent, accepting headers, reasonable delays. This is the baseline of polite scraping and is generally acceptable.
  2. Avoiding IP bans from high-volume scraping, using proxies to distribute load. This is a grey area; it is often used to circumvent rate limiting that the site uses to protect server load.
  3. Bypassing explicit anti-bot systems, defeating Cloudflare, hCaptcha, and browser fingerprinting systems that are specifically designed to block automation. This is where it gets ethically murky and sometimes legally risky.

Be honest with yourself about which category your use case falls into.

User-Agent and headers

The default python-requests/2.x.x User-Agent is blocked by most production sites immediately. A realistic browser User-Agent is the minimum viable disguise.

import random
import requests

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

HEADERS = {
    "User-Agent": random.choice(USER_AGENTS),
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
}

r = requests.get("https://httpbin.org/headers", headers=HEADERS, timeout=10)
print(r.json())

The Sec-Fetch-* headers are sent by modern browsers and their presence (or absence) is a signal many detection systems use.

Delays and randomisation

Fixed delays are a bot signature. Real users have natural variance.

import time
import random

def human_delay(min_s: float = 1.0, max_s: float = 4.0):
    """Sleep for a random duration drawn from a roughly human-like distribution."""
    # Gaussian with mean 2.5s, capped to [min_s, max_s]
    delay = max(min_s, min(max_s, random.gauss(2.5, 1.0)))
    time.sleep(delay)

Use random.gauss() instead of random.uniform() for more natural-looking timing.

Proxy types

When your IP gets blocked (or you need to distribute load across IPs), you use proxies.

Type What it is Detection risk Cost
Datacenter IP from a cloud provider (AWS, GCP, etc.) High, easily identified as non-residential Cheap (~$1-5/GB)
Residential IP from a real ISP, usually via a peer network Low, looks like a home user Expensive (~$10-30/GB)
Mobile IP from a mobile carrier network Very low, treated as premium traffic Very expensive (~$30-100/GB)

For most scraping tasks, datacenter proxies are fine. For sites with aggressive residential-IP checking, you need residential proxies.

import requests

proxies = {
    "http":  "http://user:pass@proxy-host:8080",
    "https": "http://user:pass@proxy-host:8080",
}

r = requests.get("https://httpbin.org/ip", proxies=proxies, timeout=15)
print(r.json())  # Shows the proxy's IP, not yours

Proxy rotation

Rotating through a pool of proxies prevents any single IP from hitting rate limits or getting banned.

import random
import requests

PROXY_LIST = [
    "http://user:pass@proxy1:8080",
    "http://user:pass@proxy2:8080",
    "http://user:pass@proxy3:8080",
]

def get_random_proxy() -> dict:
    proxy = random.choice(PROXY_LIST)
    return {"http": proxy, "https": proxy}

def fetch(url: str) -> requests.Response:
    for attempt in range(3):
        try:
            r = requests.get(url, proxies=get_random_proxy(), timeout=20)
            if r.status_code == 200:
                return r
        except requests.RequestException:
            pass
    raise RuntimeError(f"Failed to fetch {url} after 3 attempts")

In production, use a managed proxy provider (Bright Data, Oxylabs, Smartproxy) that handles rotation, session stickiness, and geo-targeting for you.

TLS / JA3 fingerprinting

HTTPS requires a TLS handshake. The specific combination of cipher suites, extensions, and elliptic curves your client offers forms a "JA3 fingerprint", almost as unique as a browser fingerprint.

The problem: the requests library (built on urllib3) sends a TLS fingerprint that looks nothing like Chrome. Detection systems that check JA3 will identify it as a bot even if all headers look correct.

The fix: use curl_cffi, a Python library that uses libcurl with browser-impersonation patches.

pip install curl-cffi
from curl_cffi import requests as cf_requests

# Impersonate Chrome 124's exact TLS fingerprint
r = cf_requests.get(
    "https://httpbin.org/headers",
    impersonate="chrome124",
    timeout=10,
)
print(r.json())

curl_cffi is the current best option for TLS fingerprint evasion without a full browser. Use it when requests fails and you do not want Playwright's overhead.

Browser fingerprinting and Playwright stealth

A real browser running automation has tells:

  • navigator.webdriver is true
  • Missing browser plugins, Non-standard screen dimensions, Specific canvas rendering differences

playwright-stealth patches these:

pip install playwright-stealth
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    stealth_sync(page)   # patch all webdriver tells

    page.goto("https://bot.sannysoft.com/")   # a bot-detection test page
    page.screenshot(path="stealth_test.png", full_page=True)
    browser.close()

Review the screenshot to see which tests pass.

The ethical line

Here is the honest version of where the line sits:

Generally acceptable:

  • Setting a realistic User-Agent and headers (you are presenting yourself as a client, just not identifying your tool), Adding polite delays, Using proxies to avoid overloading a single IP when you are doing high-volume but legitimate scraping of public data

Ethically murky:

  • Using proxies specifically to circumvent IP-based rate limits set to protect server load, Rotating through large proxy pools to bypass per-IP access limits, Using stealth techniques to bypass bot-detection on a site that has not otherwise blocked your access

Not acceptable:

  • Using any technique to bypass a login wall or access-controlled area you are not authorised to access, Defeating CAPTCHA systems that exist to prevent automated access, Using automation to create fake accounts or submit fraudulent forms, Scraping at rates that cause measurable harm to the service

The test: would the site operator, if they saw exactly what your scraper was doing, consider it acceptable use? If not, reconsider.

Hands-on

Let us build a minimal session that combines realistic headers, random delays, and proxy support (falling back to no proxy if none are configured):

import os
import random
import time
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

USER_AGENTS = [
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]

PROXY_URL = os.environ.get("PROXY_URL")  # e.g. "http://user:pass@host:port"


def make_stealthy_session() -> requests.Session:
    session = requests.Session()
    session.headers.update({
        "User-Agent": random.choice(USER_AGENTS),
        "Accept": "text/html,application/xhtml+xml,*/*;q=0.8",
        "Accept-Language": "en-US,en;q=0.9",
        "Accept-Encoding": "gzip, deflate, br",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
    })

    if PROXY_URL:
        session.proxies.update({"http": PROXY_URL, "https": PROXY_URL})
        print(f"Using proxy: {PROXY_URL.split('@')[-1]}")

    retry = Retry(total=3, backoff_factor=2, status_forcelist=(429, 500, 502, 503))
    session.mount("https://", HTTPAdapter(max_retries=retry))
    return session


def polite_fetch(session: requests.Session, url: str) -> requests.Response:
    # Random human-like delay
    delay = max(1.0, min(5.0, random.gauss(2.5, 1.0)))
    time.sleep(delay)

    r = session.get(url, timeout=(5, 30))
    r.raise_for_status()
    return r


# --- Demo ---
session = make_stealthy_session()

# Verify our apparent IP
r = polite_fetch(session, "https://httpbin.org/ip")
print("Our IP:", r.json()["origin"])

# Verify headers we send
r = polite_fetch(session, "https://httpbin.org/headers")
headers_sent = r.json()["headers"]
print("User-Agent:", headers_sent.get("User-Agent"))
print("Sec-Fetch-Mode:", headers_sent.get("Sec-Fetch-Mode"))

Common pitfalls

  • Rotating User-Agent without rotating other headers. Sending a Chrome User-Agent with Accept headers that do not match Chrome is an obvious inconsistency. Rotate the entire header set together, or use a consistent realistic header bundle.

  • Using datacenter proxies against sites that check for residential IPs. Many e-commerce and financial sites block datacenter IP ranges entirely. If you see 403 from a proxy, check whether the proxy IP range is blacklisted.

  • Treating stealth as a solution to bad ethics. Stealth techniques do not make ethically wrong scraping acceptable. They just make it harder to detect. If the scraping itself is wrong, stealth does not fix that.

  • Not testing proxies before use. A dead proxy causes a timeout on every request. Validate your proxy list against httpbin.org/ip before starting a large crawl.

  • Overusing proxies. Residential proxies are expensive. Use them only where genuinely needed. For public, unauthenticated, low-volume scraping, a realistic header set and polite delays are usually enough.

  • Ignoring the Retry-After header on 429. When a site sends Retry-After: 60, sleep for 60 seconds. Retrying immediately just burns through your proxy pool and makes things worse.

What to try next

  1. Visit https://bot.sannysoft.com/ with plain Playwright (no stealth) and take a screenshot. Then add playwright-stealth and compare the results. Count how many tests change from red to green.

  2. Build a proxy-validator script: given a list of proxy URLs, fetch https://httpbin.org/ip through each one with a 10-second timeout. Print a table of working vs failed proxies.

  3. Read the response headers from https://quotes.toscrape.com and identify any bot-detection signals (e.g. Cloudflare's cf-ray, server type, custom headers). What does this tell you about their detection setup?

Test Your Knowledge

Take a quick quiz on this lesson

Start Quiz →

Prefer watching over reading?

Subscribe for free.

Subscribe on YouTube