Neural Networks

What you'll build

You will implement a two-layer neural network from scratch using only NumPy, forward pass, loss computation, and backpropagation, and train it to classify two overlapping spirals. This makes the maths concrete before you switch to a framework in Lesson 8.

Concepts

The perceptron

The perceptron is the oldest neural unit. It computes a weighted sum of inputs, adds a bias, applies a step function, and outputs 0 or 1.

import numpy as np

def perceptron(x, w, b):
    return 1 if (np.dot(w, x) + b) >= 0 else 0

# AND gate
w = np.array([1.0, 1.0])
b = -1.5

print(perceptron([0, 0], w, b))  # 0
print(perceptron([0, 1], w, b))  # 0
print(perceptron([1, 0], w, b))  # 0
print(perceptron([1, 1], w, b))  # 1

The perceptron can only learn linearly separable problems. XOR, for example, cannot be solved by a single perceptron. You need multiple layers, hence the Multilayer Perceptron (MLP).

Multilayer perceptron (MLP)

An MLP stacks layers of neurons. Each layer transforms its input through a linear operation followed by a non-linear activation. The non-linearity is what gives neural networks their power, without it, stacking linear layers is equivalent to one linear layer.

Input -> [Linear + Activation] -> [Linear + Activation] -> ... -> [Linear] -> Output

In matrix form for one layer: h = activation(W @ x + b)

where W has shape (neurons_out, neurons_in), x has shape (neurons_in,), b has shape (neurons_out,).

Activation functions

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-4, 4, 200)

def sigmoid(x):  return 1 / (1 + np.exp(-x))
def tanh(x):     return np.tanh(x)
def relu(x):     return np.maximum(0, x)
def leaky_relu(x, alpha=0.1): return np.where(x >= 0, x, alpha * x)

fig, axes = plt.subplots(1, 4, figsize=(14, 4))
for ax, (name, fn) in zip(axes, [
        ("Sigmoid", sigmoid), ("Tanh", tanh),
        ("ReLU", relu), ("Leaky ReLU", leaky_relu)]):
    ax.plot(x, fn(x))
    ax.set_title(name)
    ax.axhline(0, color="gray", linewidth=0.5)
    ax.axvline(0, color="gray", linewidth=0.5)

plt.tight_layout()
plt.savefig("activations.png", dpi=150)

Sigmoid: outputs (0, 1), used in binary output layers. Suffers from vanishing gradients in deep networks.
Tanh: outputs (-1, 1), zero-centred so gradients are better than sigmoid. Still vanishes in deep networks.
ReLU: outputs 0 or x. Fast, does not vanish for positive inputs. Can "die" if neurons always output 0 (dead ReLU problem).
Leaky ReLU: fixes dying ReLU by allowing a small gradient for negative inputs.

Modern default: use ReLU (or its variants) for hidden layers, sigmoid for binary output, softmax for multi-class output.

The forward pass

A forward pass is the process of computing the output given an input. For a two-layer network:

def relu(x):     return np.maximum(0, x)
def sigmoid(x):  return 1 / (1 + np.exp(-x))

# Random initialisation
np.random.seed(42)
W1 = np.random.randn(4, 2) * 0.1   # hidden layer: 4 neurons, 2 inputs
b1 = np.zeros(4)
W2 = np.random.randn(1, 4) * 0.1   # output layer: 1 neuron, 4 inputs
b2 = np.zeros(1)

def forward(x, W1, b1, W2, b2):
    z1 = W1 @ x + b1        # (4,)
    a1 = relu(z1)            # (4,)
    z2 = W2 @ a1 + b2       # (1,)
    a2 = sigmoid(z2)         # (1,), probability
    return a2, a1, z1        # return intermediates for backprop

x = np.array([0.5, -0.3])
output, hidden, pre_hidden = forward(x, W1, b1, W2, b2)
print(f"Output probability: {output[0]:.4f}")

Loss functions

The loss function measures how wrong the predictions are. The network minimises this during training.

Binary cross-entropy (for binary classification):

Loss = -(y * log(y_hat) + (1 - y) * log(1 - y_hat))

Categorical cross-entropy (for multi-class classification):

Loss = -sum(y_k * log(y_hat_k)) for each class k

Mean Squared Error (for regression):

Loss = mean((y - y_hat)^2)

def binary_cross_entropy(y, y_hat, eps=1e-9):
    y_hat = np.clip(y_hat, eps, 1 - eps)   # avoid log(0)
    return -(y * np.log(y_hat) + (1 - y) * np.log(1 - y_hat))

# Example
y_true = 1
y_pred_good = 0.95
y_pred_bad  = 0.1
print(f"Good prediction loss: {binary_cross_entropy(y_true, y_pred_good):.4f}")
print(f"Bad prediction loss:  {binary_cross_entropy(y_true, y_pred_bad):.4f}")

Backpropagation intuition

Backpropagation computes how much each weight contributed to the loss by applying the chain rule from the output back to the input. For each weight w, you compute the gradient dL/dw, how much the loss changes if you nudge w slightly.

The update rule (gradient descent): w = w - learning_rate * dL/dw

Forward:  Input → Layer 1 → Layer 2 → Loss
Backward: Loss → dL/dW2 → dL/dW1  (chain rule)

Hands-on

Full two-layer neural network from scratch on the spiral dataset.

import numpy as np
import matplotlib.pyplot as plt

# --- Spiral dataset ---
def make_spiral(n_per_class=100, noise=0.1, seed=0):
    rng = np.random.default_rng(seed)
    X, y = [], []
    for cls in range(2):
        t = np.linspace(0, 1, n_per_class)
        angle = t * 3 * np.pi + cls * np.pi
        r = t
        X.append(np.stack([r * np.cos(angle), r * np.sin(angle)], axis=1))
        y.extend([cls] * n_per_class)
    X = np.vstack(X) + rng.normal(0, noise, (2 * n_per_class, 2))
    y = np.array(y)
    return X, y

X, y = make_spiral(n_per_class=150, noise=0.15, seed=42)

# --- Initialisers ---
def relu(x):         return np.maximum(0, x)
def relu_deriv(x):   return (x > 0).astype(float)
def sigmoid(x):      return 1 / (1 + np.exp(-np.clip(x, -500, 500)))

rng = np.random.default_rng(99)
H = 16   # hidden neurons
W1 = rng.normal(0, np.sqrt(2 / 2),  (H, 2))  # He init
b1 = np.zeros(H)
W2 = rng.normal(0, np.sqrt(2 / H),  (1, H))
b2 = np.zeros(1)

lr = 0.05
losses = []

# --- Training loop ---
for epoch in range(2000):
    # --- Forward ---
    z1  = (W1 @ X.T).T + b1   # (N, H)
    a1  = relu(z1)             # (N, H)
    z2  = (W2 @ a1.T).T + b2  # (N, 1)
    a2  = sigmoid(z2).ravel()  # (N,)

    # --- Loss ---
    eps  = 1e-9
    a2c  = np.clip(a2, eps, 1 - eps)
    loss = -np.mean(y * np.log(a2c) + (1 - y) * np.log(1 - a2c))
    losses.append(loss)

    # --- Backward ---
    N = len(y)
    dL_da2 = (a2 - y) / N               # (N,)  BCE gradient
    dL_dz2 = dL_da2                      # sigmoid gradient absorbed

    dL_dW2 = (dL_dz2[:, None] * a1).mean(axis=0, keepdims=True) * N / N
    dL_dW2 = dL_dz2[:, None].T @ a1 / N  # (1, H)
    dL_db2 = dL_dz2.mean()

    dL_da1 = dL_dz2[:, None] @ W2        # (N, H)
    dL_dz1 = dL_da1 * relu_deriv(z1)    # (N, H)
    dL_dW1 = dL_dz1.T @ X / N           # (H, 2)
    dL_db1 = dL_dz1.mean(axis=0)         # (H,)

    # --- Update ---
    W2 -= lr * dL_dW2
    b2 -= lr * dL_db2
    W1 -= lr * dL_dW1
    b1 -= lr * dL_db1

    if (epoch + 1) % 500 == 0:
        pred = (a2 >= 0.5).astype(int)
        acc  = (pred == y).mean()
        print(f"Epoch {epoch+1:4d}  Loss: {loss:.4f}  Acc: {acc:.3f}")

# --- Loss curve ---
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].plot(losses)
axes[0].set_xlabel("Epoch")
axes[0].set_ylabel("Loss")
axes[0].set_title("Training Loss")

# --- Decision boundary ---
xx, yy = np.meshgrid(np.linspace(-1.5, 1.5, 200), np.linspace(-1.5, 1.5, 200))
grid = np.c_[xx.ravel(), yy.ravel()]
z1g  = (W1 @ grid.T).T + b1
a1g  = relu(z1g)
z2g  = (W2 @ a1g.T).T + b2
proba = sigmoid(z2g).ravel()

axes[1].contourf(xx, yy, proba.reshape(xx.shape), levels=20, cmap="RdBu_r", alpha=0.6)
axes[1].scatter(X[:, 0], X[:, 1], c=y, cmap="RdBu_r", edgecolors="k", s=20)
axes[1].set_title("Decision Boundary")

plt.tight_layout()
plt.savefig("nn_scratch.png", dpi=150)
print("Saved nn_scratch.png")

Common pitfalls

Not normalising inputs. Neural networks are sensitive to input scale. If features have very different magnitudes, gradients explode or vanish. Always standardise inputs.

Zero-initialising all weights. If all weights start at zero, all neurons in a layer produce identical outputs and receive identical gradients, the network never learns different features. Use small random initialisation.

Learning rate too large. Loss oscillates wildly or diverges. Start with 0.01 and reduce if needed.

Forgetting to clip probabilities before log. log(0) is negative infinity. Always use np.clip(p, 1e-9, 1 - 1e-9) in loss functions.

Reporting training accuracy only. Training accuracy can reach near 100% even on a terrible model that just memorised the data. Always monitor validation loss.

What to try next

Add a third hidden layer and compare how it affects convergence speed and final accuracy., Implement momentum or Adam update rule on top of this from-scratch network., Move on to Lesson 8 (Deep Learning Frameworks) where PyTorch handles all the gradient bookkeeping for you.