Machine Learning
Supervised and unsupervised learning, regression, SVM, decision trees, neural nets, clustering, PCA.
Why this unit matters
Machine learning is the largest and most application-focused unit in GATE DA. It carries the highest share of marks and is the reason most students choose this paper over GATE CS. Questions range from "what does this algorithm do" to numerical computations (computing a regression coefficient, finding cluster centroids, tracing a decision tree split). The unit rewards students who understand the maths behind the algorithms, not just their names. If you have covered probability, linear algebra, and calculus first, this unit will click much faster.
Syllabus map
| Sub-topic | Key concepts |
|---|---|
| Supervised learning | Regression, classification |
| Linear regression | OLS, ridge, coefficients, R-squared |
| Logistic regression | Sigmoid, log-loss, decision boundary |
| kNN | Distance metrics, choice of k |
| Naive Bayes | Conditional independence, Gaussian NB |
| LDA | Linear discriminant, within/between scatter |
| SVM | Margin, support vectors, kernel trick |
| Decision trees | Information gain, Gini, pruning |
| Bias-variance | Trade-off, underfitting, overfitting |
| Cross-validation | k-fold, LOO, stratification |
| MLP | Forward pass, activation functions |
| Unsupervised | Clustering, dimensionality reduction |
| k-means / k-medoid | Centroid vs. medoid, convergence |
| Hierarchical clustering | Linkage methods, dendrograms |
| PCA | Eigenvectors, explained variance |
Linear regression
Simple linear regression: y = beta_0 + beta_1 * x + epsilon.
OLS (Ordinary Least Squares) minimises the sum of squared residuals. Closed-form solution:
beta_1 = Cov(X, Y) / Var(X) beta_0 = mean(Y), beta_1 * mean(X)
Multiple regression: y = X * beta, where X is the design matrix (with a column of ones). OLS solution: beta = (X^T X)^(-1) X^T y.
Ridge regression adds an L2 penalty to prevent overfitting: minimise ||y, Xbeta||^2 + lambda * ||beta||^2. Solution: beta = (X^T X + lambdaI)^(-1) X^T y. Ridge always produces a unique solution even when X^T X is singular.
from sklearn.linear_model import LinearRegression, Ridge
import numpy as np
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
lr = LinearRegression()
lr.fit(X, y)
print(f"Slope: {lr.coef_[0]:.2f}, Intercept: {lr.intercept_:.2f}")
ridge = Ridge(alpha=1.0)
ridge.fit(X, y)
print(f"Ridge slope: {ridge.coef_[0]:.2f}")
Trap. Ridge regression shrinks coefficients towards zero but does not make them exactly zero. Lasso (L1 penalty) produces sparse solutions. GATE may test which regularisation gives sparsity.
Logistic regression
Models the probability that y = 1:
P(y=1 | x) = 1 / (1 + e^(-w^T x)) (sigmoid function)
Training minimises binary cross-entropy (log-loss):
Loss = -[y * log(p) + (1-y) * log(1-p)]
The decision boundary is the hyperplane w^T x = 0 (probability = 0.5).
Trap. Despite the name, logistic regression is a classification algorithm, not regression. GATE questions sometimes use this to catch students off guard.
k-Nearest Neighbours
kNN classifies a new point by majority vote of its k nearest neighbours in the training set.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
- k = 1: very flexible, high variance, low bias., Large k: smoother boundary, lower variance, higher bias., Distance metric matters: Euclidean is default; Manhattan or Minkowski are alternatives., kNN is a lazy learner, it stores all training data and computes at prediction time.
Trap. kNN does not work well with high-dimensional data (curse of dimensionality). Normalising features before computing distances is essential.
Naive Bayes
Uses Bayes' theorem assuming conditional independence of features given the class:
P(y | x1, ..., xn) proportional to P(y) * product of P(xi | y)
Gaussian Naive Bayes assumes each P(xi | y) follows a Gaussian distribution with parameters estimated from training data.
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
The independence assumption is "naive" and often violated in practice, yet Naive Bayes performs surprisingly well in text classification and spam filtering.
LDA (Linear Discriminant Analysis)
LDA finds the linear projection that maximises the ratio of between-class scatter to within-class scatter. The Fisher criterion:
J(w) = (w^T S_B w) / (w^T S_W w)
where S_B is the between-class scatter matrix and S_W is the within-class scatter matrix.
The optimal w is the eigenvector of S_W^(-1) S_B corresponding to the largest eigenvalue.
Unlike logistic regression, LDA makes a parametric assumption: features are normally distributed with equal covariance matrices across classes.
SVM (Support Vector Machine)
SVM finds the hyperplane that maximises the margin, the distance between the hyperplane and the nearest training points (support vectors).
The margin is 2 / ||w||. Maximising margin = minimising ||w||^2.
Soft-margin SVM (for non-linearly separable data): introduces slack variables and a penalty C. Large C = hard margin (more sensitive to outliers). Small C = wide margin (more misclassification tolerated).
Kernel trick: replaces the dot product x^T x' with a kernel function K(x, x'), mapping data to a higher-dimensional space without computing that mapping explicitly., Linear kernel: K(x, x') = x^T x'
- Polynomial: K(x, x') = (x^T x' + c)^d, RBF (Gaussian): K(x, x') = exp(-gamma * ||x, x'||^2)
from sklearn.svm import SVC
svm = SVC(kernel='rbf', C=1.0, gamma='scale')
svm.fit(X_train, y_train)
Trap. Support vectors are the training points that lie on the margin boundary. Points far from the decision boundary do not influence it, removing them does not change the model.
Decision trees
A decision tree splits data recursively by the feature that gives the best "purity" gain.
Information gain (ID3): IG = Entropy(parent), weighted sum of Entropy(children). Entropy = -sum p_i * log2(p_i).
Gini impurity (CART): Gini = 1, sum p_i^2. Lower Gini = more pure.
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(criterion='gini', max_depth=3)
dt.fit(X_train, y_train)
Overfitting is the main risk with deep trees. Mitigations: max_depth, min_samples_split, min_samples_leaf, or post-pruning.
Bias-variance trade-off
Total expected error = Bias^2 + Variance + Irreducible noise.
- High bias (underfitting): model too simple. Training and test error both high., High variance (overfitting): model too complex. Training error low, test error high.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# Degree 1 = high bias, degree 10 = high variance
for degree in [1, 3, 10]:
poly = PolynomialFeatures(degree=degree)
X_poly = poly.fit_transform(X_train)
Trap. Regularisation (ridge, lasso) increases bias but reduces variance. The goal is the right trade-off, not zero bias.
Cross-validation
k-fold: split data into k equal folds. Train on k-1, test on 1. Repeat k times. Average the scores.
LOO (Leave-One-Out): k = n. Each single data point is the test set once. Expensive but low bias.
Stratified k-fold: ensures class proportions are preserved in each fold. Preferred for imbalanced datasets.
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring='accuracy')
print(f"Mean accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
MLP (Multi-Layer Perceptron)
An MLP has an input layer, one or more hidden layers, and an output layer. Each unit applies an activation function to the weighted sum of its inputs.
Common activations: ReLU (max(0,x)), sigmoid (1/(1+e^-x)), tanh.
Forward pass: compute activations layer by layer. Loss is computed at the output. Backpropagation uses the chain rule to compute gradients for each weight.
k-means clustering
- Initialise k centroids (randomly or k-means++).
- Assign each point to the nearest centroid.
- Update centroids to the mean of their assigned points.
- Repeat until convergence.
from sklearn.cluster import KMeans
km = KMeans(n_clusters=3, random_state=42, n_init=10)
km.fit(X)
labels = km.labels_
centroids = km.cluster_centers_
k-medoid: instead of the mean, the centroid is one of the actual data points (the medoid). More robust to outliers. Used when the mean is not meaningful (e.g., categorical data).
Trap. k-means is sensitive to initialisation and can converge to local optima. k-means++ initialisation gives better results. The number of clusters k must be chosen beforehand, use the elbow method or silhouette score.
Hierarchical clustering
Builds a dendrogram (tree of clusters). Two approaches:
- Agglomerative (bottom-up): start with each point as its own cluster; merge the closest pair repeatedly.
- Divisive (top-down): start with one cluster; split recursively.
Linkage methods:
- Single linkage: distance = min distance between any two points in the clusters. Tends to produce elongated clusters., Complete linkage: distance = max distance. Tends to produce compact clusters., Average linkage: distance = average of all pairwise distances.
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X, method='complete')
Trap. Single linkage suffers from "chaining", one outlier can link two distant clusters.
PCA (Principal Component Analysis)
PCA finds directions of maximum variance in the data (principal components) and projects data onto them.
Steps:
- Centre the data (subtract mean).
- Compute the covariance matrix.
- Find eigenvectors and eigenvalues of the covariance matrix.
- Sort by eigenvalue (descending), these are the principal components.
- Project data onto the top k components.
from sklearn.decomposition import PCA
import numpy as np
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
The explained variance ratio tells you what fraction of total variance is captured by each component.
Trap. PCA is a linear technique. It may not capture non-linear structure. Also, PCA components are not necessarily interpretable, each component is a linear combination of original features.
Worked examples
Example 1. Find the OLS slope for X = [1,2,3,4,5] and Y = [2,4,5,4,5].
mean_X = 3, mean_Y = 4. Cov(X,Y) = [(1-3)(2-4)+(2-3)(4-4)+(3-3)(5-4)+(4-3)(4-4)+(5-3)(5-4)] / 5 = [4+0+0+0+2]/5 = 1.2. Var(X) = [(4+1+0+1+4)]/5 = 2. beta_1 = 1.2 / 2 = 0.6.
Example 2. A kNN classifier with k=3. Distances from new point to training points: 0.5(class A), 1.2(class B), 0.8(class A), 2.0(class B), 0.9(class A). What is the prediction?
Three nearest: 0.5(A), 0.8(A), 0.9(A). All three are class A. Prediction: class A.
Example 3. Data: [(1,+),(1,-),(2,+),(3,+),(3,-)]. Compute entropy of the root node.
3 positive, 2 negative. P(+) = 3/5, P(-) = 2/5. Entropy = -(3/5)log2(3/5) - (2/5)log2(2/5) ≈ -0.6*(-0.737), 0.4*(-1.322) ≈ 0.442 + 0.529 = 0.971.
Example 4. k-means with k=2. Points: [1,2,3,8,9,10]. Centroids initialised at 1 and 8. After one iteration, what are the new centroids?
Assign: 1,2,3 to centroid 1 (distances 0,1,2 vs. 7,6,5). 8,9,10 to centroid 8 (distances 7,8,9 vs. 0,1,2). New centroids: mean(1,2,3) = 2 and mean(8,9,10) = 9.
Example 5. PCA on a 2D dataset where Var(X1) = 4, Var(X2) = 1, Cov(X1,X2) = 0. What are the principal components?
The covariance matrix is diagonal: [[4,0],[0,1]]. Eigenvectors are [1,0] (eigenvalue 4) and [0,1] (eigenvalue 1). First PC is X1 (explains 80% of variance), second PC is X2.
Quick-revision summary
- Ridge: L2 penalty, shrinks coefficients. Lasso: L1 penalty, sets some coefficients to zero., Logistic regression: classification algorithm. Decision boundary at sigmoid = 0.5., kNN: lazy learner, normalise features, small k = high variance., SVM: maximise margin. Support vectors define the decision boundary., Decision tree: split on information gain (entropy) or Gini impurity., Bias-variance: regularisation trades off bias for variance., k-means: centroid = mean. k-medoid: centroid = actual data point., PCA: eigenvectors of covariance matrix, sort by eigenvalue., Single linkage: min distance (chaining effect). Complete linkage: max distance (compact clusters).
How to study this unit
- Understand OLS linear regression fully before moving on: derive the normal equations, compute a small numerical example by hand.
- Implement logistic regression, kNN, and k-means in raw Python/numpy (no sklearn) once each. This forces you to understand the algorithm.
- For SVM, focus on the geometric interpretation: margin, support vectors, and what the kernel trick achieves. You do not need to solve QP problems.
- Practice decision tree splitting by hand: compute entropy or Gini for a 2-3 feature dataset and choose the best split.
- Do 5 PCA exercises: given a covariance matrix, find eigenvalues and eigenvectors, sort them, and explain how much variance the first component captures.
- Bias-variance is a conceptual topic but appears in multiple GATE questions. Know the effect of regularisation, model complexity, and training set size on each.
- Use sklearn to verify your manual computations, not to skip them.
Prefer watching over reading?
Subscribe for free.