Supervised Learning

What you'll build

You will train four different classifiers on the same dataset, compare their accuracy, and wrap the best one in a scikit-learn pipeline that handles preprocessing and prediction in a single call.

Concepts

Linear regression

Linear regression fits a line (or hyperplane) through the data by minimising the sum of squared residuals.

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np

X, y = make_regression(n_samples=300, n_features=3, noise=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE: {rmse:.2f}")
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_:.2f}")

The coefficients tell you how much y changes per unit change in each feature, assuming all other features are held constant. This interpretability is one reason linear models are still widely used.

Logistic regression

Despite the name, logistic regression is a classification algorithm. It models the probability that a sample belongs to class 1 using the sigmoid function.

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=500, n_features=4, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

print(f"Accuracy: {clf.score(X_test, y_test):.3f}")
proba = clf.predict_proba(X_test)[:5]  # probability for each class
print("First 5 probability pairs:\n", proba)

predict_proba returns [P(class=0), P(class=1)] for each sample. The default threshold is 0.5, you can change it by thresholding the second column yourself. This is important when classes are imbalanced.

Decision trees and random forests

A decision tree learns a sequence of if-else rules. It is easy to interpret but prone to overfitting. A random forest builds many trees on random subsets of data and features, then averages their predictions.

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# Decision tree, can overfit badly with no depth limit
tree = DecisionTreeClassifier(max_depth=4, random_state=42)
tree.fit(X_train, y_train)
print("Tree accuracy:", tree.score(X_test, y_test))

# Random forest, much more robust
forest = RandomForestClassifier(n_estimators=100, random_state=42)
forest.fit(X_train, y_train)
print("Forest accuracy:", forest.score(X_test, y_test))

# Feature importance
import pandas as pd
importance = pd.Series(forest.feature_importances_,
                       index=[f"f{i}" for i in range(4)])
print(importance.sort_values(ascending=False))

n_estimators=100 means 100 trees. More trees = lower variance but slower training. Usually 100-500 is enough, returns diminish after that.

k-Nearest Neighbours (kNN)

kNN predicts by finding the k training samples closest to the query point and taking a vote (classification) or average (regression).

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

# kNN needs scaling, distance is meaningless when features have different units
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc  = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=5, metric="euclidean")
knn.fit(X_train_sc, y_train)
print("kNN accuracy:", knn.score(X_test_sc, y_test))

kNN has no training phase, it just memorises the data. Prediction is slow on large datasets because it searches all training points. Do not use kNN on datasets with more than a few hundred thousand samples without approximate search (like faiss).

scikit-learn pipelines

A pipeline chains preprocessing and modelling steps into a single object. You can fit, predict, cross-validate, and tune it as one unit.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf",   LogisticRegression(max_iter=1000)),
])

pipe.fit(X_train, y_train)
print("Pipeline accuracy:", pipe.score(X_test, y_test))

# predict works exactly the same
y_pred = pipe.predict(X_test)

Pipelines prevent data leakage when used with cross-validation because the scaler is refitted on each fold's training data.

Hands-on

Let us train all four classifiers on a real-ish dataset (breast cancer from scikit-learn), compare them, and build a final pipeline.

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# --- Load data ---
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

print(f"Samples: {X.shape[0]}  Features: {X.shape[1]}")
print(f"Classes: {data.target_names}  Distribution: {np.bincount(y)}")

# --- Split ---
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# --- Define classifiers ---
classifiers = {
    "Logistic Regression": Pipeline([
        ("scale", StandardScaler()),
        ("clf",   LogisticRegression(max_iter=2000))
    ]),
    "Decision Tree": DecisionTreeClassifier(max_depth=5, random_state=42),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "kNN": Pipeline([
        ("scale", StandardScaler()),
        ("clf",   KNeighborsClassifier(n_neighbors=7))
    ]),
}

# --- Cross-validation on training set ---
print("\nCross-validation accuracy (5-fold on training set):")
results = {}
for name, clf in classifiers.items():
    scores = cross_val_score(clf, X_train, y_train, cv=5, scoring="accuracy")
    results[name] = scores.mean()
    print(f"  {name}: {scores.mean():.4f} +/- {scores.std():.4f}")

# --- Best model final evaluation on test set ---
best_name = max(results, key=results.get)
print(f"\nBest model: {best_name}")

best_clf = classifiers[best_name]
best_clf.fit(X_train, y_train)
y_pred = best_clf.predict(X_test)

print("\nClassification report on test set:")
print(classification_report(y_test, y_pred, target_names=data.target_names))

# --- Feature importance (only for Random Forest) ---
if best_name == "Random Forest":
    importances = pd.Series(
        best_clf.feature_importances_, index=feature_names
    ).sort_values(ascending=False)
    print("\nTop 5 features:")
    print(importances.head())

The cross-validation step is done before touching the test set. The test set is only used once, for the final report. This is the right workflow.

Common pitfalls

Evaluating on the training set. model.score(X_train, y_train) tells you nothing useful. Always evaluate on held-out data.

Not scaling for kNN and logistic regression. kNN uses distance; logistic regression converges much faster with scaled features. Forgetting to scale is one of the most common mistakes for new practitioners.

Setting max_depth=None on decision trees. This allows trees to grow until every leaf is pure, which memorises the training set. Use max_depth or min_samples_leaf to control overfitting.

Using accuracy on imbalanced data. If 95% of samples are class 0, a model that predicts everything as 0 gets 95% accuracy. Use precision, recall, and F1 instead. Lesson 5 covers this in detail.

Ignoring random_state. Without a fixed seed, results change every run. Always set random_state for reproducibility.

What to try next

Tune n_neighbors in kNN using a validation curve, try values 1, 3, 5, 7, 11, 21., Look into GradientBoostingClassifier and XGBClassifier, they often outperform random forests., Read about ColumnTransformer to handle mixed numeric / categorical features in a single pipeline., Move on to Lesson 5 (Model Evaluation) where you will learn to measure model quality honestly.