Scheduling and Deployment
What you'll build
By the end of this lesson you will have a Dockerfile for your scraper, a docker-compose.yml for local testing, a GitHub Actions workflow that runs the scraper on a schedule, and a logging setup that tells you exactly what happened when the scraper breaks at 3 AM, and it will break at 3 AM.
Concepts
Structuring a scraper for production
A scraper script that prints to stdout works on your laptop. In production, you need:
- Structured logging, not
print(), butloggingwith timestamps and levels. - Exit codes,
sys.exit(1)on failure so cron and Docker know the job failed. - Configuration via environment variables, not hardcoded URLs, credentials, or limits.
- Idempotent storage, re-running the scraper must not corrupt data (covered in Lesson 8).
# scraper/main.py
import logging
import os
import sys
from datetime import datetime
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(name)s, %(message)s",
handlers=[
logging.StreamHandler(sys.stdout), # console
logging.FileHandler("scraper.log", encoding="utf-8"), # file
],
)
logger = logging.getLogger("books_scraper")
def main():
max_pages = int(os.environ.get("MAX_PAGES", "10"))
db_path = os.environ.get("DB_PATH", "books.db")
logger.info(f"Starting scrape: max_pages={max_pages}, db={db_path}")
try:
# ... your scraping logic here ...
logger.info("Scrape completed successfully")
except Exception as e:
logger.exception(f"Scrape failed: {e}")
sys.exit(1)
if __name__ == "__main__":
main()
cron on a Linux VPS
cron is the classic Unix job scheduler. Every line in a crontab is a schedule + a command.
# Edit crontab
crontab -e
# Run the scraper every day at 2:30 AM
30 2 * * * /home/ubuntu/venv/bin/python /home/ubuntu/scraper/main.py >> /home/ubuntu/logs/scraper.log 2>&1
# Run every 6 hours
0 */6 * * * /home/ubuntu/venv/bin/python /home/ubuntu/scraper/main.py >> /home/ubuntu/logs/scraper.log 2>&1
crontab syntax: minute hour day-of-month month day-of-week. Use crontab.guru to verify your expression.
Limitations of cron:
- No retry on failure, No alert when the job crashes (unless you configure
MAILTO), Logs go to a file; you have to SSH in to read them, No easy way to pass environment variables
For a quick VPS setup, cron is fine. For anything more serious, use systemd timers or a proper scheduler like Prefect or Airflow.
systemd timers
systemd timers are more reliable than cron: they support restart policies, boot-time execution, and integration with journalctl.
# /etc/systemd/system/books-scraper.service
[Unit]
Description=Books scraper
After=network.target
[Service]
Type=oneshot
User=ubuntu
WorkingDirectory=/home/ubuntu/scraper
ExecStart=/home/ubuntu/venv/bin/python main.py
EnvironmentFile=/home/ubuntu/scraper/.env
StandardOutput=journal
StandardError=journal
# /etc/systemd/system/books-scraper.timer
[Unit]
Description=Run books scraper daily
[Timer]
OnCalendar=daily
Persistent=true # run if the machine was off at the scheduled time
[Install]
WantedBy=timers.target
sudo systemctl enable books-scraper.timer
sudo systemctl start books-scraper.timer
journalctl -u books-scraper.service -f # follow logs in real time
GitHub Actions for free scheduled scraping
For scrapers that run daily and store results in the repo (or push to a cloud store), GitHub Actions is free and requires no server.
# .github/workflows/scrape.yml
name: Daily scrape
on:
schedule:
- cron: "30 2 * * *" # 2:30 AM UTC every day
workflow_dispatch: # allow manual trigger
jobs:
scrape:
runs-on: ubuntu-latest
timeout-minutes: 30
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"
- name: Install dependencies
run: pip install -r requirements.txt
- name: Run scraper
env:
MAX_PAGES: "5"
DB_PATH: "data/books.db"
run: python scraper/main.py
- name: Commit and push updated data
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add data/
git diff --staged --quiet || git commit -m "chore: update scraped data $(date -u '+%Y-%m-%d')"
git push
Caveats for GitHub Actions scrapers:
- Scheduled jobs can be delayed by up to 15 minutes during high load., GitHub disables scheduled workflows on repos with no activity for 60 days., The free tier gives 2000 minutes/month for public repos, unlimited for private repos, it is effectively unlimited for a daily 5-minute scraper.
Dockerfile for a Python scraper
FROM python:3.12-slim
WORKDIR /app
# Install system deps (lxml needs libxml2)
RUN apt-get update && apt-get install -y --no-install-recommends \
libxml2-dev libxslt-dev \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY scraper/ .
CMD ["python", "main.py"]
# docker-compose.yml
services:
scraper:
build: .
environment:
MAX_PAGES: "10"
DB_PATH: /data/books.db
volumes:
- ./data:/data
restart: "no" # do not restart after completion
Build and run:
docker compose build
docker compose run --rm scraper
To run on a schedule with Docker, combine with a cron job on the host that calls docker compose run --rm scraper.
Deploying to a cheap VPS
Good options for a scraping VPS (in 2026):
- Hetzner Cloud, cheapest per-CPU-minute, excellent network, EU-based.
- DigitalOcean, great developer experience, easy droplets.
- Hostinger VPS, good if you already have Hostinger for hosting (shared billing).
Recommended minimal setup:
# On a fresh Ubuntu 24.04 VPS
# 1. Install Python and venv
sudo apt update && sudo apt install -y python3.12 python3.12-venv git
# 2. Clone your scraper repo
git clone https://github.com/youruser/your-scraper.git /home/ubuntu/scraper
# 3. Create virtualenv and install dependencies
cd /home/ubuntu/scraper
python3.12 -m venv venv
./venv/bin/pip install -r requirements.txt
# 4. Create .env with secrets
echo "DB_PATH=/home/ubuntu/data/books.db" > .env
echo "MAX_PAGES=50" >> .env
# 5. Test it manually
./venv/bin/python main.py
# 6. Set up cron or systemd timer
Logging, monitoring, and alerting
Logging with logging module is step one. Step two is getting notified when it fails.
Simple email alert via cron MAILTO:
# At the top of your crontab
MAILTO=suparn@yoursite.com
For more control, send a Slack/Discord message on failure:
import os
import requests as req
def notify_failure(error: str):
webhook = os.environ.get("SLACK_WEBHOOK_URL")
if webhook:
req.post(webhook, json={"text": f"Scraper FAILED: {error}"}, timeout=5)
Call notify_failure(str(e)) in your except block.
Hands-on
Here is a complete production-ready scraper layout with all the pieces:
your-scraper/
scraper/
__init__.py
main.py ← entry point
scrape.py ← scraping logic
store.py ← SQLite storage
data/
.gitkeep
Dockerfile
docker-compose.yml
requirements.txt
.env.example ← template, committed; .env is in .gitignore
.github/
workflows/
scrape.yml
requirements.txt:
requests==2.32.3
beautifulsoup4==4.12.3
lxml==5.2.2
scraper/main.py, the complete entry point:
import logging
import os
import sys
from scrape import crawl_books
from store import init_db, upsert_books, start_run, finish_run
import sqlite3
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(levelname)s %(name)s, %(message)s",
)
logger = logging.getLogger("main")
def main():
db_path = os.environ.get("DB_PATH", "data/books.db")
max_pages = int(os.environ.get("MAX_PAGES", "10"))
logger.info(f"DB={db_path}, MAX_PAGES={max_pages}")
conn = sqlite3.connect(db_path)
init_db(conn)
run_id = start_run(conn, "books")
try:
books, pages = crawl_books(max_pages=max_pages)
upsert_books(conn, books)
finish_run(conn, run_id, pages=pages, items=len(books))
logger.info(f"Done. {len(books)} books, {pages} pages.")
except Exception as e:
logger.exception("Scrape failed")
conn.close()
sys.exit(1)
conn.close()
if __name__ == "__main__":
main()
Common pitfalls
-
Running
cronwithout absolute paths. cron does not inherit your shell'sPATH. Always use full paths:/home/ubuntu/venv/bin/python, not justpython. -
Not redirecting stderr in cron.
>> scraper.log 2>&1captures both stdout and stderr. Without2>&1, exceptions and tracebacks go to/dev/null(or an email to root you will never read). -
GitHub Actions scheduled workflows silently disabled. GitHub disables scheduled workflows in repos with no recent activity. If your scraper stops running, push a commit to re-enable it.
-
Writing to a relative path inside Docker. If your script writes to
./books.dbinside a container and you have not mounted a volume, the file disappears when the container stops. Always mount a volume for data files. -
Not testing locally before deploying. Run
docker compose run --rm scraperlocally before pushing to a VPS. A 2-minute test saves an hour of SSH debugging. -
Forgetting
sys.exit(1)on failure. Without it, a script that crashes still exits with code 0 (success). cron, GitHub Actions, and systemd will think the job succeeded and not alert you.
What to try next
-
Deploy the scraper to a free GitHub Actions runner using the
scrape.ymlworkflow above. Trigger it manually withworkflow_dispatchand check the Actions log. -
Add a Slack webhook notification: on failure, post the error message to a
#alertschannel. Test it by intentionally raising an exception at the start ofmain(). -
Add a
--dry-runcommand-line flag usingargparse. When set, the scraper fetches and parses but does not write to the database. Use this for CI checks that verify the scraper still works without polluting production data.
Prefer watching over reading?
Subscribe for free.