Unsupervised Learning

What you'll build

You will cluster a dataset of customer spending patterns using k-means, find the right number of clusters with the elbow method, reduce dimensions with PCA, and visualise high-dimensional data in 2D using t-SNE.

Concepts

k-means clustering

k-means divides data into k clusters by alternating between two steps: (1) assign each point to the nearest centroid, (2) recompute centroids as the mean of assigned points. It repeats until assignments stop changing.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, y_true = make_blobs(n_samples=300, centers=4, cluster_std=0.8, random_state=42)

kmeans = KMeans(n_clusters=4, random_state=42, n_init="auto")
kmeans.fit(X)

labels     = kmeans.labels_
centroids  = kmeans.cluster_centers_
inertia    = kmeans.inertia_   # sum of squared distances to nearest centroid

print(f"Inertia: {inertia:.2f}")
print(f"Cluster sizes: {np.bincount(labels)}")

fig, ax = plt.subplots()
ax.scatter(X[:, 0], X[:, 1], c=labels, cmap="tab10", alpha=0.7)
ax.scatter(centroids[:, 0], centroids[:, 1], c="black", marker="x", s=100, label="centroids")
ax.legend()
plt.title("k-means Clustering")
plt.tight_layout()
plt.savefig("kmeans.png", dpi=150)

k-means assumes clusters are roughly spherical and equal in size. It is sensitive to outliers because the mean is pulled by extreme values.

Choosing k, the elbow method

You rarely know k in advance. The elbow method plots inertia (within-cluster sum of squares) against k and looks for the point where the improvement starts to slow down.

inertias = []
k_range = range(1, 11)

for k in k_range:
    km = KMeans(n_clusters=k, random_state=42, n_init="auto")
    km.fit(X)
    inertias.append(km.inertia_)

fig, ax = plt.subplots()
ax.plot(k_range, inertias, "o-")
ax.set_xlabel("Number of clusters k")
ax.set_ylabel("Inertia")
plt.title("Elbow Method")
plt.tight_layout()
plt.savefig("elbow.png", dpi=150)

Look for the "elbow", the k where adding one more cluster gives a much smaller reduction in inertia. It is often not a sharp elbow; use domain knowledge alongside it.

k-medoids

k-medoids is similar to k-means but uses actual data points as cluster centres (medoids) instead of means. This makes it robust to outliers and works with non-Euclidean distance metrics.

# pip install scikit-learn-extra
from sklearn_extra.cluster import KMedoids

kmed = KMedoids(n_clusters=4, random_state=42)
kmed.fit(X)
print("k-medoids cluster sizes:", np.bincount(kmed.labels_))

Use k-medoids when your data has outliers or when you need cluster representatives to be real data points (e.g., picking representative customers from a segment).

Hierarchical clustering

Hierarchical clustering builds a tree of clusters (a dendrogram) by merging the two closest clusters at each step. You do not need to choose k upfront, you cut the tree at the desired level.

from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
import matplotlib.pyplot as plt

Z = linkage(X, method="ward")   # ward linkage minimises within-cluster variance

fig, ax = plt.subplots(figsize=(10, 4))
dendrogram(Z, ax=ax, truncate_mode="level", p=5)
ax.set_title("Dendrogram (Ward linkage)")
plt.tight_layout()
plt.savefig("dendrogram.png", dpi=150)

# Cut the tree to get 4 clusters
labels_hier = fcluster(Z, t=4, criterion="maxclust")
print("Hierarchical cluster sizes:", np.bincount(labels_hier - 1))

Ward linkage is usually a good default. Single linkage tends to form "chained" clusters; complete linkage tends to form compact ones.

PCA, dimensionality reduction

PCA (Principal Component Analysis) finds new axes (principal components) that capture the most variance. It is used to:

Reduce the number of features before training a model, Visualise high-dimensional data in 2D or 3D, Remove noise (low-variance dimensions)

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits

digits = load_digits()
X_digits = digits.data          # shape (1797, 64)
y_digits = digits.target

# Scale before PCA
scaler  = StandardScaler()
X_scaled = scaler.fit_transform(X_digits)

# PCA to 2 components for visualisation
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")

fig, ax = plt.subplots(figsize=(8, 6))
scatter = ax.scatter(X_2d[:, 0], X_2d[:, 1], c=y_digits, cmap="tab10", alpha=0.5, s=10)
plt.colorbar(scatter, ax=ax)
plt.title("Digits projected to 2D via PCA")
plt.tight_layout()
plt.savefig("pca_digits.png", dpi=150)

t-SNE intuition

t-SNE (t-distributed Stochastic Neighbour Embedding) is a non-linear technique that is much better than PCA at revealing cluster structure in visualisations. It preserves local neighbourhoods, nearby points in high dimensions stay nearby in 2D.

from sklearn.manifold import TSNE

# t-SNE is slow; run on a subset if needed
X_sub = X_scaled[:500]
y_sub = y_digits[:500]

tsne   = TSNE(n_components=2, perplexity=30, random_state=42, n_iter=1000)
X_tsne = tsne.fit_transform(X_sub)

fig, ax = plt.subplots(figsize=(8, 6))
scatter = ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sub, cmap="tab10", alpha=0.7, s=15)
plt.colorbar(scatter, ax=ax)
plt.title("Digits in 2D via t-SNE")
plt.tight_layout()
plt.savefig("tsne_digits.png", dpi=150)

t-SNE is for visualisation only, the axes have no interpretable meaning, and distance between clusters is not meaningful. Do not use t-SNE as a preprocessing step for a downstream model.

Hands-on

Let us cluster customer segments from a shopping dataset using k-means and then interpret the clusters with PCA.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# --- Synthetic customer data ---
rng = np.random.default_rng(42)
n   = 300

# Three "types": budget, mid-range, luxury
budgets    = rng.multivariate_normal([20, 500],  [[5,50],[50,2000]],  100)
midrangers = rng.multivariate_normal([45, 2000], [[8,100],[100,5000]], 100)
luxury     = rng.multivariate_normal([70, 8000], [[5,200],[200,8000]], 100)

X_raw = np.vstack([budgets, midrangers, luxury])
X_raw = np.clip(X_raw, 0, None)   # no negative values
df    = pd.DataFrame(X_raw, columns=["visit_freq", "avg_spend"])

# --- Scale ---
scaler  = StandardScaler()
X_scaled = scaler.fit_transform(df)

# --- Elbow method ---
inertias = []
for k in range(1, 9):
    km = KMeans(n_clusters=k, random_state=42, n_init="auto")
    inertias.append(km.fit(X_scaled).inertia_)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))
axes[0].plot(range(1, 9), inertias, "o-")
axes[0].set_xlabel("k")
axes[0].set_ylabel("Inertia")
axes[0].set_title("Elbow Curve")

# --- Fit k=3 ---
km3    = KMeans(n_clusters=3, random_state=42, n_init="auto")
labels = km3.fit_predict(X_scaled)

df["cluster"] = labels

# Cluster profiles
print("\nCluster profiles (original scale):")
print(df.groupby("cluster").mean().round(1))

# --- Visualise in original feature space ---
colors = ["steelblue", "coral", "seagreen"]
for c in range(3):
    mask = labels == c
    axes[1].scatter(df.loc[mask, "visit_freq"],
                    df.loc[mask, "avg_spend"],
                    c=colors[c], label=f"Cluster {c}", alpha=0.7)

axes[1].set_xlabel("Visit Frequency (per month)")
axes[1].set_ylabel("Avg Spend (INR)")
axes[1].legend()
axes[1].set_title("Customer Segments")

plt.tight_layout()
plt.savefig("customer_segments.png", dpi=150)
print("Saved customer_segments.png")

Looking at the cluster profiles, mean visit frequency and mean spend, you can give business-meaningful names to each cluster: "Casual", "Regular", "High-value". That interpretation step is where the real value of clustering comes from.

Common pitfalls

Not scaling before k-means. If one feature is in thousands and another is in single digits, k-means will essentially ignore the small-scale feature. Always scale first.

Treating the elbow as a definitive answer. The elbow method gives a hint. Always validate with domain knowledge and silhouette scores.

Using t-SNE output as features for modelling. t-SNE is a visualisation tool, not a feature extractor. The axes are not stable across runs and the distances are not meaningful globally.

Running k-means once with one random initialisation. k-means can converge to local optima. Use n_init=10 or higher (the default is 10 in recent scikit-learn) to run multiple restarts and pick the best.

Thinking clusters are ground truth. Clustering is exploratory. Clusters reflect the structure in your features; change the features and the clusters change too.

What to try next

Compute the silhouette score (sklearn.metrics.silhouette_score) as an alternative to the elbow method., Try DBSCAN, it can find clusters of arbitrary shape and automatically labels outliers as noise., Read about UMAP, it is faster than t-SNE and better preserves global structure., Move on to Lesson 7 (Neural Networks) where you will build models that learn their own features.