Scheduling and Deployment

What you'll build

By the end of this lesson you will have a Dockerfile for your scraper, a docker-compose.yml for local testing, a GitHub Actions workflow that runs the scraper on a schedule, and a logging setup that tells you exactly what happened when the scraper breaks at 3 AM, and it will break at 3 AM.

Concepts

Structuring a scraper for production

A scraper script that prints to stdout works on your laptop. In production, you need:

Structured logging, not print(), but logging with timestamps and levels.
Exit codes, sys.exit(1) on failure so cron and Docker know the job failed.
Configuration via environment variables, not hardcoded URLs, credentials, or limits.
Idempotent storage, re-running the scraper must not corrupt data (covered in Lesson 8).

# scraper/main.py
import logging
import os
import sys
from datetime import datetime

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(name)s, %(message)s",
    handlers=[
        logging.StreamHandler(sys.stdout),               # console
        logging.FileHandler("scraper.log", encoding="utf-8"),  # file
    ],
)
logger = logging.getLogger("books_scraper")


def main():
    max_pages = int(os.environ.get("MAX_PAGES", "10"))
    db_path   = os.environ.get("DB_PATH", "books.db")

    logger.info(f"Starting scrape: max_pages={max_pages}, db={db_path}")
    try:
        # ... your scraping logic here ...
        logger.info("Scrape completed successfully")
    except Exception as e:
        logger.exception(f"Scrape failed: {e}")
        sys.exit(1)


if __name__ == "__main__":
    main()

cron on a Linux VPS

cron is the classic Unix job scheduler. Every line in a crontab is a schedule + a command.

# Edit crontab
crontab -e

# Run the scraper every day at 2:30 AM
30 2 * * * /home/ubuntu/venv/bin/python /home/ubuntu/scraper/main.py >> /home/ubuntu/logs/scraper.log 2>&1

# Run every 6 hours
0 */6 * * * /home/ubuntu/venv/bin/python /home/ubuntu/scraper/main.py >> /home/ubuntu/logs/scraper.log 2>&1

crontab syntax: minute hour day-of-month month day-of-week. Use crontab.guru to verify your expression.

Limitations of cron:

No retry on failure, No alert when the job crashes (unless you configure MAILTO), Logs go to a file; you have to SSH in to read them, No easy way to pass environment variables

For a quick VPS setup, cron is fine. For anything more serious, use systemd timers or a proper scheduler like Prefect or Airflow.

systemd timers

systemd timers are more reliable than cron: they support restart policies, boot-time execution, and integration with journalctl.

# /etc/systemd/system/books-scraper.service
[Unit]
Description=Books scraper
After=network.target

[Service]
Type=oneshot
User=ubuntu
WorkingDirectory=/home/ubuntu/scraper
ExecStart=/home/ubuntu/venv/bin/python main.py
EnvironmentFile=/home/ubuntu/scraper/.env
StandardOutput=journal
StandardError=journal

# /etc/systemd/system/books-scraper.timer
[Unit]
Description=Run books scraper daily

[Timer]
OnCalendar=daily
Persistent=true   # run if the machine was off at the scheduled time

[Install]
WantedBy=timers.target

sudo systemctl enable books-scraper.timer
sudo systemctl start books-scraper.timer
journalctl -u books-scraper.service -f   # follow logs in real time

GitHub Actions for free scheduled scraping

For scrapers that run daily and store results in the repo (or push to a cloud store), GitHub Actions is free and requires no server.

# .github/workflows/scrape.yml
name: Daily scrape

on:
  schedule:
    - cron: "30 2 * * *"   # 2:30 AM UTC every day
  workflow_dispatch:        # allow manual trigger

jobs:
  scrape:
    runs-on: ubuntu-latest
    timeout-minutes: 30

    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: "3.12"

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Run scraper
        env:
          MAX_PAGES: "5"
          DB_PATH: "data/books.db"
        run: python scraper/main.py

      - name: Commit and push updated data
        run: |
          git config user.name "github-actions[bot]"
          git config user.email "github-actions[bot]@users.noreply.github.com"
          git add data/
          git diff --staged --quiet || git commit -m "chore: update scraped data $(date -u '+%Y-%m-%d')"
          git push

Caveats for GitHub Actions scrapers:

Scheduled jobs can be delayed by up to 15 minutes during high load., GitHub disables scheduled workflows on repos with no activity for 60 days., The free tier gives 2000 minutes/month for public repos, unlimited for private repos, it is effectively unlimited for a daily 5-minute scraper.

Dockerfile for a Python scraper

FROM python:3.12-slim

WORKDIR /app

# Install system deps (lxml needs libxml2)
RUN apt-get update && apt-get install -y --no-install-recommends \
    libxml2-dev libxslt-dev \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY scraper/ .

CMD ["python", "main.py"]

# docker-compose.yml
services:
  scraper:
    build: .
    environment:
      MAX_PAGES: "10"
      DB_PATH: /data/books.db
    volumes:
      - ./data:/data
    restart: "no"   # do not restart after completion

Build and run:

docker compose build
docker compose run --rm scraper

To run on a schedule with Docker, combine with a cron job on the host that calls docker compose run --rm scraper.

Deploying to a cheap VPS

Good options for a scraping VPS (in 2026):

Hetzner Cloud, cheapest per-CPU-minute, excellent network, EU-based.
DigitalOcean, great developer experience, easy droplets.
Hostinger VPS, good if you already have Hostinger for hosting (shared billing).

Recommended minimal setup:

# On a fresh Ubuntu 24.04 VPS

# 1. Install Python and venv
sudo apt update && sudo apt install -y python3.12 python3.12-venv git

# 2. Clone your scraper repo
git clone https://github.com/youruser/your-scraper.git /home/ubuntu/scraper

# 3. Create virtualenv and install dependencies
cd /home/ubuntu/scraper
python3.12 -m venv venv
./venv/bin/pip install -r requirements.txt

# 4. Create .env with secrets
echo "DB_PATH=/home/ubuntu/data/books.db" > .env
echo "MAX_PAGES=50" >> .env

# 5. Test it manually
./venv/bin/python main.py

# 6. Set up cron or systemd timer

Logging, monitoring, and alerting

Logging with logging module is step one. Step two is getting notified when it fails.

Simple email alert via cron MAILTO:

# At the top of your crontab
MAILTO=suparn@yoursite.com

For more control, send a Slack/Discord message on failure:

import os
import requests as req

def notify_failure(error: str):
    webhook = os.environ.get("SLACK_WEBHOOK_URL")
    if webhook:
        req.post(webhook, json={"text": f"Scraper FAILED: {error}"}, timeout=5)

Call notify_failure(str(e)) in your except block.

Hands-on

Here is a complete production-ready scraper layout with all the pieces:

your-scraper/
  scraper/
    __init__.py
    main.py          ← entry point
    scrape.py        ← scraping logic
    store.py         ← SQLite storage
  data/
    .gitkeep
  Dockerfile
  docker-compose.yml
  requirements.txt
  .env.example       ← template, committed; .env is in .gitignore
  .github/
    workflows/
      scrape.yml

requirements.txt:

requests==2.32.3
beautifulsoup4==4.12.3
lxml==5.2.2

scraper/main.py, the complete entry point:

import logging
import os
import sys
from scrape import crawl_books
from store import init_db, upsert_books, start_run, finish_run
import sqlite3

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(name)s, %(message)s",
)
logger = logging.getLogger("main")


def main():
    db_path   = os.environ.get("DB_PATH", "data/books.db")
    max_pages = int(os.environ.get("MAX_PAGES", "10"))

    logger.info(f"DB={db_path}, MAX_PAGES={max_pages}")

    conn = sqlite3.connect(db_path)
    init_db(conn)
    run_id = start_run(conn, "books")

    try:
        books, pages = crawl_books(max_pages=max_pages)
        upsert_books(conn, books)
        finish_run(conn, run_id, pages=pages, items=len(books))
        logger.info(f"Done. {len(books)} books, {pages} pages.")
    except Exception as e:
        logger.exception("Scrape failed")
        conn.close()
        sys.exit(1)

    conn.close()


if __name__ == "__main__":
    main()

Common pitfalls

Running cron without absolute paths. cron does not inherit your shell's PATH. Always use full paths: /home/ubuntu/venv/bin/python, not just python.
Not redirecting stderr in cron. >> scraper.log 2>&1 captures both stdout and stderr. Without 2>&1, exceptions and tracebacks go to /dev/null (or an email to root you will never read).
GitHub Actions scheduled workflows silently disabled. GitHub disables scheduled workflows in repos with no recent activity. If your scraper stops running, push a commit to re-enable it.
Writing to a relative path inside Docker. If your script writes to ./books.db inside a container and you have not mounted a volume, the file disappears when the container stops. Always mount a volume for data files.
Not testing locally before deploying. Run docker compose run --rm scraper locally before pushing to a VPS. A 2-minute test saves an hour of SSH debugging.
Forgetting sys.exit(1) on failure. Without it, a script that crashes still exits with code 0 (success). cron, GitHub Actions, and systemd will think the job succeeded and not alert you.

What to try next

Deploy the scraper to a free GitHub Actions runner using the scrape.yml workflow above. Trigger it manually with workflow_dispatch and check the Actions log.
Add a Slack webhook notification: on failure, post the error message to a #alerts channel. Test it by intentionally raising an exception at the start of main().
Add a --dry-run command-line flag using argparse. When set, the scraper fetches and parses but does not write to the database. Use this for CI checks that verify the scraper still works without polluting production data.