Model Evaluation

What you'll build

You will compute a full suite of evaluation metrics on a classifier, plot the confusion matrix and ROC curve, perform k-fold cross-validation, and identify whether a model is underfitting or overfitting using learning curves.

Concepts

Bias / variance trade-off

Bias is the error that comes from wrong assumptions, an underfit model has high bias. Variance is the error that comes from sensitivity to small fluctuations in the training set, an overfit model has high variance.

A model with high bias fails on both training and validation data. A model with high variance does well on training data but poorly on validation data.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import validation_curve

# Toy dataset
rng = np.random.default_rng(42)
X = rng.uniform(0, 1, 60).reshape(-1, 1)
y = np.sin(2 * np.pi * X.ravel()) + rng.normal(0, 0.2, 60)

degrees = [1, 2, 4, 8, 15]
train_errors, val_errors = [], []

from sklearn.model_selection import cross_val_score
for d in degrees:
    pipe = Pipeline([
        ("poly", PolynomialFeatures(degree=d)),
        ("lr",   LinearRegression()),
    ])
    cv_scores = cross_val_score(pipe, X, y, cv=5, scoring="neg_mean_squared_error")
    val_errors.append(-cv_scores.mean())

    pipe.fit(X, y)
    train_errors.append(np.mean((y - pipe.predict(X))**2))

fig, ax = plt.subplots()
ax.plot(degrees, train_errors, "o-", label="Train MSE")
ax.plot(degrees, val_errors, "s--", label="Validation MSE")
ax.set_xlabel("Polynomial degree")
ax.set_ylabel("MSE")
ax.legend()
plt.title("Bias-Variance Illustration")
plt.tight_layout()
plt.savefig("bias_variance.png", dpi=150)

Degree 1 underfits. Degree 15 overfits. You want the sweet spot where validation error is lowest.

Cross-validation

Cross-validation estimates how well your model generalises without wasting data on a separate validation set. In k-fold CV, you split the data into k folds, train on k-1 folds, and evaluate on the remaining fold, repeated k times.

from sklearn.model_selection import cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

clf = RandomForestClassifier(n_estimators=50, random_state=42)
kf  = KFold(n_splits=5, shuffle=True, random_state=42)

scores = cross_val_score(clf, X, y, cv=kf, scoring="accuracy")
print(f"Accuracy per fold: {scores.round(3)}")
print(f"Mean: {scores.mean():.4f}  Std: {scores.std():.4f}")

Use StratifiedKFold for classification to preserve class ratios in each fold. KFold with shuffle=True is fine for regression.

Confusion matrix

The confusion matrix shows how many samples of each true class were predicted as each predicted class. For binary classification:

True Positive (TP): correctly predicted positive, True Negative (TN): correctly predicted negative, False Positive (FP): predicted positive but actually negative (Type I error), False Negative (FN): predicted negative but actually positive (Type II error)

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

cm = confusion_matrix(y_test, y_pred)
print(cm)

disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap="Blues")
import matplotlib.pyplot as plt
plt.tight_layout()
plt.savefig("confusion_matrix.png", dpi=150)

Precision, recall, and F1

Accuracy is misleading when classes are imbalanced. Use precision, recall, and F1.

Precision = TP / (TP + FP), "of all predicted positives, how many were actually positive?", Recall = TP / (TP + FN), "of all actual positives, how many did we catch?", F1 = 2 * (Precision * Recall) / (Precision + Recall), harmonic mean of both

from sklearn.metrics import classification_report, precision_score, recall_score, f1_score

print(classification_report(y_test, y_pred))

# Individual metrics
p = precision_score(y_test, y_pred)
r = recall_score(y_test, y_pred)
f = f1_score(y_test, y_pred)
print(f"Precision: {p:.3f}  Recall: {r:.3f}  F1: {f:.3f}")

In medical diagnosis, a false negative (missing a disease) is usually worse than a false positive. So you optimise for recall. In spam detection, a false positive (marking a legitimate email as spam) is worse. So you optimise for precision.

ROC curve and AUC

The ROC (Receiver Operating Characteristic) curve plots True Positive Rate vs False Positive Rate across all classification thresholds. The Area Under the Curve (AUC) summarises this, 1.0 is perfect, 0.5 is random.

from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt

y_proba = clf.predict_proba(X_test)[:, 1]  # probability of positive class
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)

fig, ax = plt.subplots()
ax.plot(fpr, tpr, label=f"ROC (AUC = {auc:.3f})")
ax.plot([0, 1], [0, 1], "k--", label="Random")
ax.set_xlabel("False Positive Rate")
ax.set_ylabel("True Positive Rate")
ax.legend()
plt.title("ROC Curve")
plt.tight_layout()
plt.savefig("roc_curve.png", dpi=150)
print(f"AUC: {auc:.4f}")

Hands-on

Let us do a complete evaluation workflow: train two models, compare them with cross-validation, pick the winner, and produce all evaluation plots on the test set.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import (train_test_split, StratifiedKFold,
                                     cross_val_score, learning_curve)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (classification_report, confusion_matrix,
                              ConfusionMatrixDisplay, roc_auc_score, roc_curve)

# --- Load and split ---
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

models = {
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "Gradient Boosting": GradientBoostingClassifier(n_estimators=100, random_state=42),
}

# --- Cross-validate both ---
print("5-fold CV F1 scores:")
best_score, best_name, best_model = 0, None, None
for name, model in models.items():
    scores = cross_val_score(model, X_train, y_train, cv=skf, scoring="f1")
    mean_f1 = scores.mean()
    print(f"  {name}: {mean_f1:.4f} +/- {scores.std():.4f}")
    if mean_f1 > best_score:
        best_score, best_name, best_model = mean_f1, name, model

print(f"\nWinner: {best_name}")

# --- Fit best model and evaluate on test set ---
best_model.fit(X_train, y_train)
y_pred  = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print("\nClassification report:")
print(classification_report(y_test, y_pred))

# --- Confusion matrix ---
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

cm   = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(cm)
disp.plot(ax=axes[0], cmap="Blues", colorbar=False)
axes[0].set_title("Confusion Matrix")

# --- ROC curve ---
fpr, tpr, _ = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)
axes[1].plot(fpr, tpr, label=f"{best_name} (AUC={auc:.3f})")
axes[1].plot([0, 1], [0, 1], "k--", label="Random")
axes[1].set_xlabel("FPR")
axes[1].set_ylabel("TPR")
axes[1].legend()
axes[1].set_title("ROC Curve")

plt.tight_layout()
plt.savefig("evaluation_suite.png", dpi=150)

# --- Learning curve to check bias/variance ---
train_sizes, train_scores, val_scores = learning_curve(
    best_model, X_train, y_train, cv=5,
    train_sizes=np.linspace(0.1, 1.0, 10), scoring="f1"
)

fig, ax = plt.subplots()
ax.plot(train_sizes, train_scores.mean(axis=1), "o-", label="Train F1")
ax.plot(train_sizes, val_scores.mean(axis=1),   "s--", label="Val F1")
ax.set_xlabel("Training size")
ax.set_ylabel("F1 score")
ax.legend()
plt.title("Learning Curve")
plt.tight_layout()
plt.savefig("learning_curve.png", dpi=150)
print("Saved evaluation_suite.png and learning_curve.png")

Common pitfalls

Reporting only accuracy on imbalanced data. If 90% of samples are class 0, a dummy classifier that always predicts 0 gets 90% accuracy. Report F1, AUC, or confusion matrix instead.

Touching the test set more than once. Every time you look at test results and adjust your model, you are effectively fitting to the test set. Reserve it strictly for the final report.

Not using stratified splits. Plain train_test_split can put all rare-class samples in one partition by chance. Always use stratify=y for classification.

Confusing micro, macro, and weighted averages. classification_report shows all three. For imbalanced classes, weighted average is usually most representative. Macro average gives equal weight to each class.

Picking a model based on cross-validation and then forgetting to refit on all training data. After cross-validation tells you which model is best, refit that model on the full training set (not just one fold) before predicting on the test set.

What to try next

Try sklearn.metrics.PrecisionRecallDisplay for the precision-recall curve, it is more informative than ROC when classes are very imbalanced., Read about Matthews Correlation Coefficient (MCC), it handles imbalanced binary classification better than F1., Experiment with adjusting the classification threshold (y_proba > 0.3 instead of > 0.5) and see how it moves you along the ROC curve., Move on to Lesson 6 (Unsupervised Learning) where labels are gone and you have to find structure yourself.