Maths Foundations for ML

What you'll build

You will implement matrix multiplication from scratch using NumPy, compute gradients manually, and visualise common probability distributions, all while connecting each concept to where it appears in real ML models.

Concepts

Vectors and dot products

A vector is an ordered list of numbers. In ML, a single data point is almost always a vector: the feature vector. If a house has 3 features (size, bedrooms, age), it is represented as a vector with 3 components.

The dot product of two vectors multiplies corresponding elements and sums them up:

import numpy as np

a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])

dot = np.dot(a, b)          # 1*4 + 2*5 + 3*6 = 32.0
also_dot = (a * b).sum()    # same result

print(dot)  # 32.0

In linear regression, the prediction is exactly a dot product: y_hat = w . x + bias. Every forward pass in a neural network is a long chain of dot products.

Matrices and matrix multiplication

A matrix is a 2D array. Matrix multiplication chains transformations. If A has shape (m, k) and B has shape (k, n), then A @ B has shape (m, n).

A = np.array([[1, 2],
              [3, 4],
              [5, 6]])  # shape (3, 2)

B = np.array([[7, 8, 9],
              [10, 11, 12]])  # shape (2, 3)

C = A @ B   # shape (3, 3)
print(C)
# [[ 27  30  33]
#  [ 61  68  75]
#  [ 95 106 117]]

Each element C[i][j] is the dot product of row i of A and column j of B. This is the core operation in neural network layers: output = activation(W @ x + b).

Eigenvalues and SVD intuition

Eigenvalues describe how a matrix stretches or compresses space along particular directions called eigenvectors. If A v = lambda v, then v is an eigenvector and lambda is its eigenvalue.

M = np.array([[3.0, 1.0],
              [1.0, 3.0]])

eigenvalues, eigenvectors = np.linalg.eig(M)
print("Eigenvalues:", eigenvalues)    # [4. 2.]
print("Eigenvectors:\n", eigenvectors)

PCA (Principal Component Analysis) is literally finding the eigenvectors of the covariance matrix. The eigenvector with the largest eigenvalue is the direction of greatest variance.

Singular Value Decomposition (SVD) generalises this to non-square matrices: M = U S V^T. The diagonal of S holds singular values, sorted from largest to smallest, the top ones capture most information. This is how image compression and recommendation systems work.

Derivatives and gradient intuition

A derivative tells you the slope of a function at a point: if you nudge the input by a tiny amount, how much does the output change?

For f(x) = x^2, the derivative is f'(x) = 2x. At x = 3, the slope is 6. If you increase x by a small step h, f increases by roughly 6h.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-3, 3, 300)
y = x**2
dy = 2 * x   # analytical derivative

fig, ax = plt.subplots()
ax.plot(x, y, label="f(x) = x^2")
ax.plot(x, dy, label="f'(x) = 2x", linestyle="--")
ax.axhline(0, color="black", linewidth=0.5)
ax.legend()
plt.title("Function and its derivative")
plt.tight_layout()
plt.savefig("derivative.png", dpi=150)

In ML, gradient descent uses derivatives (or gradients, the multi-dimensional version) to walk downhill on the loss surface. The gradient points in the direction of steepest increase; you subtract it to decrease the loss.

Probability distributions you'll meet again

Normal (Gaussian). Bell curve, characterised by mean mu and standard deviation sigma. Many ML models assume errors are normally distributed.

Bernoulli / Binomial. Binary outcomes. Logistic regression models the probability of class 1 as a Bernoulli probability.

Uniform. Used to initialise weights randomly in some weight initialisation schemes.

from scipy import stats
import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

x_norm = np.linspace(-4, 4, 200)
axes[0].plot(x_norm, stats.norm.pdf(x_norm, loc=0, scale=1))
axes[0].set_title("Normal(0, 1)")

n, p = 20, 0.4
k = np.arange(0, n + 1)
axes[1].bar(k, stats.binom.pmf(k, n, p), color="steelblue")
axes[1].set_title(f"Binomial(n={n}, p={p})")

x_unif = np.linspace(-0.5, 1.5, 200)
axes[2].plot(x_unif, stats.uniform.pdf(x_unif, loc=0, scale=1))
axes[2].set_title("Uniform(0, 1)")

plt.tight_layout()
plt.savefig("distributions.png", dpi=150)

Hands-on

Let us implement linear regression using pure matrix maths, no scikit-learn, to make the linear algebra concrete.

The closed-form solution is called the Normal Equation: w = (X^T X)^{-1} X^T y

import numpy as np
import matplotlib.pyplot as plt

# --- Synthetic data ---
rng = np.random.default_rng(42)
n = 100
X_raw = rng.uniform(0, 10, size=n)       # feature: house size
noise  = rng.normal(0, 2, size=n)
y      = 3.5 * X_raw + 10 + noise        # true: slope=3.5, intercept=10

# --- Add bias column (column of ones) ---
X = np.column_stack([np.ones(n), X_raw]) # shape (100, 2)

# --- Normal Equation ---
# w = (X^T X)^{-1} X^T y
# Use linalg.solve instead of inv for numerical stability
XtX = X.T @ X          # (2, 2)
Xty = X.T @ y          # (2,)
w   = np.linalg.solve(XtX, Xty)

print(f"Intercept: {w[0]:.3f}  (true: 10)")
print(f"Slope:     {w[1]:.3f}  (true: 3.5)")

# --- Predictions and R-squared ---
y_hat    = X @ w
residuals = y - y_hat
sse      = (residuals ** 2).sum()
sst      = ((y - y.mean()) ** 2).sum()
r_sq     = 1 - sse / sst
print(f"R-squared: {r_sq:.4f}")

# --- Plot ---
fig, ax = plt.subplots()
ax.scatter(X_raw, y, alpha=0.5, label="data")
ax.plot(X_raw, y_hat, color="red",
        label=f"fit: y = {w[1]:.2f}x + {w[0]:.2f}")
ax.set_xlabel("Size (100 sqft)")
ax.set_ylabel("Price (lakhs)")
ax.legend()
plt.tight_layout()
plt.savefig("normal_equation_fit.png", dpi=150)
print("Saved normal_equation_fit.png")

What is happening here:

np.column_stack([np.ones(n), X_raw]) adds the bias column so the intercept is learned the same way as any other weight, a standard trick.
np.linalg.solve(A, b) is more numerically stable than np.linalg.inv(A) @ b. Prefer solve whenever you need A^{-1} b., R-squared measures how much variance in y your model explains. 1.0 is perfect; 0.0 means you are no better than predicting the mean.

Common pitfalls

Inverting matrices when you should solve. np.linalg.inv(A) @ b amplifies numerical errors, especially for nearly-singular matrices. Always use np.linalg.solve(A, b).

Confusing row vectors and column vectors. In NumPy, np.array([1, 2, 3]) has shape (3,), it is neither a row nor a column. If you need a column vector, use .reshape(-1, 1). Shape mismatches in matrix products are a very common bug.

Ignoring numerical scale. If one feature is in the thousands and another is in fractions, gradient descent converges poorly. Always scale features before applying gradient-based algorithms.

Forgetting log probabilities. If you multiply 1000 small probabilities you get numerical zero. Work in log space (log(p1) + log(p2) + ...) when implementing loss functions.

Conflating correlation with causation. Two features being correlated does not mean one causes the other. Maths gives you correlation; domain knowledge gives you causation.

What to try next

Implement gradient descent for the same linear regression problem instead of using the Normal Equation and compare results., Read about L1 and L2 norms, they appear in Lasso and Ridge regularisation., Look up the covariance matrix and trace how PCA is derived from its eigenvectors., Move on to Lesson 3 (Data Handling) where scaling, encoding, and splitting become practical.