Deep Learning Frameworks

What you'll build

You will build a small Convolutional Neural Network (CNN) in PyTorch that recognises handwritten digits from the MNIST dataset. By the end you will have a saved model file and a working predict function.

Concepts

PyTorch tensors

Tensors are PyTorch's equivalent of NumPy arrays. They can live on a GPU and support automatic differentiation.

import torch

# Creating tensors
a = torch.tensor([1.0, 2.0, 3.0])
b = torch.zeros(3, 4)          # 3x4 zeros
c = torch.randn(2, 3)          # random normal

print(a.shape)                  # torch.Size([3])
print(b.dtype)                  # torch.float32

# Tensor ops, same as NumPy
x = torch.arange(6, dtype=torch.float32).reshape(2, 3)
y = x @ x.T                    # matrix multiply
print(y)

# Convert between NumPy and PyTorch
import numpy as np
arr = np.array([1.0, 2.0, 3.0])
t   = torch.from_numpy(arr)     # shares memory!
arr2 = t.numpy()

# Device
device = "cuda" if torch.cuda.is_available() else "cpu"
t_gpu = t.to(device)
print(f"Using device: {device}")

Autograd, automatic differentiation

Autograd tracks operations on tensors and computes gradients automatically. You mark tensors that need gradients with requires_grad=True.

import torch

x = torch.tensor(3.0, requires_grad=True)
y = x ** 2 + 2 * x + 1   # y = (x+1)^2

y.backward()               # compute dy/dx

print(x.grad)              # 2x + 2 evaluated at x=3 -> 8.0

For a network, PyTorch builds a computation graph during the forward pass. Calling loss.backward() propagates gradients through this graph. optimizer.step() then updates the weights.

nn.Module, defining a model

All PyTorch models inherit from nn.Module. You define layers in __init__ and the forward pass in forward.

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim),
        )

    def forward(self, x):
        return self.net(x)

model = MLP(input_dim=784, hidden_dim=128, output_dim=10)
print(model)

# Count parameters
total = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total:,}")

Optimizers and the training loop

The training loop: forward pass, compute loss, backward pass, update weights. Repeat.

import torch.optim as optim

model     = MLP(784, 128, 10)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

# Dummy data, replace with your DataLoader
X_dummy = torch.randn(64, 784)
y_dummy = torch.randint(0, 10, (64,))

for epoch in range(3):
    optimizer.zero_grad()           # 1. clear old gradients
    logits = model(X_dummy)         # 2. forward pass
    loss   = criterion(logits, y_dummy)  # 3. compute loss
    loss.backward()                 # 4. backward pass
    optimizer.step()                # 5. update weights

    print(f"Epoch {epoch+1}  Loss: {loss.item():.4f}")

optimizer.zero_grad() must be called before each backward pass, otherwise gradients accumulate across batches.

Saving and loading models

# Save
torch.save(model.state_dict(), "model.pth")

# Load
loaded = MLP(784, 128, 10)
loaded.load_state_dict(torch.load("model.pth"))
loaded.eval()   # switch to evaluation mode (affects dropout, batch norm)

Hands-on

Build and train a CNN on MNIST. A CNN uses convolutional layers to learn spatial features, far more efficient than a flat MLP for image data.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
import matplotlib.pyplot as plt

# --- Hyperparameters ---
BATCH_SIZE = 128
EPOCHS     = 5
LR         = 1e-3
DEVICE     = "cuda" if torch.cuda.is_available() else "cpu"

# --- Data ---
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))   # MNIST mean and std
])

train_data = datasets.MNIST("./data", train=True,  download=True, transform=transform)
test_data  = datasets.MNIST("./data", train=False, download=True, transform=transform)

train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
test_loader  = DataLoader(test_data,  batch_size=BATCH_SIZE, shuffle=False)

# --- CNN model ---
class SmallCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(1, 16, kernel_size=3, padding=1),  # 28x28 -> 28x28
            nn.ReLU(),
            nn.MaxPool2d(2),                              # 28x28 -> 14x14
            nn.Conv2d(16, 32, kernel_size=3, padding=1), # 14x14 -> 14x14
            nn.ReLU(),
            nn.MaxPool2d(2),                              # 14x14 -> 7x7
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),                # 32 * 7 * 7 = 1568
            nn.Linear(32 * 7 * 7, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 10),
        )

    def forward(self, x):
        return self.classifier(self.features(x))

model     = SmallCNN().to(DEVICE)
optimizer = optim.Adam(model.parameters(), lr=LR)
criterion = nn.CrossEntropyLoss()

print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")

# --- Training ---
train_losses, test_accs = [], []

for epoch in range(EPOCHS):
    model.train()
    total_loss = 0
    for X_batch, y_batch in train_loader:
        X_batch, y_batch = X_batch.to(DEVICE), y_batch.to(DEVICE)

        optimizer.zero_grad()
        logits = model(X_batch)
        loss   = criterion(logits, y_batch)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    train_losses.append(avg_loss)

    # --- Evaluation ---
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            X_batch, y_batch = X_batch.to(DEVICE), y_batch.to(DEVICE)
            preds   = model(X_batch).argmax(dim=1)
            correct += (preds == y_batch).sum().item()
            total   += len(y_batch)

    acc = correct / total
    test_accs.append(acc)
    print(f"Epoch {epoch+1}/{EPOCHS}  Loss: {avg_loss:.4f}  Test Acc: {acc:.4f}")

# --- Save ---
torch.save(model.state_dict(), "mnist_cnn.pth")
print("Saved mnist_cnn.pth")

# --- Plots ---
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].plot(train_losses, "o-")
axes[0].set_title("Training Loss")
axes[0].set_xlabel("Epoch")

axes[1].plot(test_accs, "s-", color="green")
axes[1].set_title("Test Accuracy")
axes[1].set_xlabel("Epoch")
axes[1].set_ylim(0.9, 1.0)

plt.tight_layout()
plt.savefig("cnn_training.png", dpi=150)

# --- Predict on a single image ---
model.eval()
sample_img, sample_label = test_data[0]
with torch.no_grad():
    pred = model(sample_img.unsqueeze(0).to(DEVICE)).argmax(dim=1).item()
print(f"\nSample: true={sample_label}  predicted={pred}")

A few things to note:

Conv2d(1, 16, ...) means 1 input channel (grayscale), 16 output filters.
MaxPool2d(2) halves the spatial dimensions, reducing computation.
Dropout(0.3) randomly sets 30% of neurons to zero during training, a regularisation technique that reduces overfitting.
model.eval() and torch.no_grad() together prevent gradient tracking during evaluation, saving memory and making inference faster.

Common pitfalls

Forgetting optimizer.zero_grad(). Gradients accumulate by default in PyTorch. Without zeroing them, each backward pass adds to the previous one.

Not switching between model.train() and model.eval(). Dropout and BatchNorm behave differently in train vs eval mode. Forgetting model.eval() before evaluation gives noisy results.

Moving tensors to different devices. You cannot operate on a CPU tensor and a GPU tensor together. Always move both the model and the data to the same device.

Using loss.item() vs loss. loss.item() extracts a plain Python float and detaches from the computation graph. Use it when logging. If you accidentally keep a reference to loss (a tensor) in a list, it keeps the entire computation graph in memory.

Normalising test data with test statistics. Compute mean and std on the training set only. For MNIST, the canonical values (0.1307, 0.3081) come from the training set.

What to try next

Add BatchNorm layers (nn.BatchNorm2d) between conv and activation and observe the effect on training stability., Try a learning rate scheduler (optim.lr_scheduler.StepLR) to reduce LR every few epochs., Swap MNIST for FashionMNIST (same format, harder task) by changing one word in the datasets.MNIST call., Move on to Lesson 9 (NLP and LLMs) where you will work with text instead of images.