Python for ML

What you'll build

By the end of this lesson you will have a small data-analysis script that loads a CSV of student exam scores, cleans it up, does some numeric operations with NumPy, and produces a histogram, the exact workflow you'll repeat hundreds of times in ML projects.

Concepts

NumPy arrays and why they exist

Python lists are flexible but slow. NumPy arrays store numbers in contiguous memory and apply operations to the whole block at once. That is why a NumPy operation on a million numbers is 50-100x faster than a Python loop.

import numpy as np

scores = np.array([72, 85, 91, 60, 78])

print(scores.mean())   # 77.2
print(scores.std())    # 10.68...
print(scores > 75)     # [False  True  True False  True]

The last line, scores > 75, is a boolean mask. It works on every element without a for-loop. This is called vectorisation.

Broadcasting

When you add a scalar to an array, NumPy "broadcasts" the scalar across all elements. When shapes are compatible, it also works between arrays of different sizes.

a = np.array([[1, 2, 3],
              [4, 5, 6]])   # shape (2, 3)

bias = np.array([10, 20, 30])  # shape (3,)

print(a + bias)
# [[11 22 33]
#  [14 25 36]]

NumPy stretches bias along axis 0 to match a. No copies are made in memory, it is a view trick. Broadcasting rules: dimensions are compared right-to-left, and a dimension of 1 can always stretch to match the other.

pandas DataFrames

A DataFrame is a table where columns can have different types. Think of it as a spreadsheet with a Python API.

import pandas as pd

df = pd.read_csv("scores.csv")

print(df.head())           # first 5 rows
print(df.dtypes)           # column types
print(df.describe())       # count, mean, std, min, quartiles, max

# Filter rows
top = df[df["score"] > 80]

# Add a column
df["grade"] = df["score"].apply(lambda x: "A" if x >= 90 else "B")

The .apply() method is useful but slow on large data. Whenever possible, use vectorised pandas operations like np.where or direct arithmetic on the Series.

matplotlib basics

matplotlib is the default plotting library. You will mostly use plt.subplots() and pick a chart type.

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.hist(df["score"], bins=10, edgecolor="black")
ax.set_title("Score Distribution")
ax.set_xlabel("Score")
ax.set_ylabel("Count")
plt.tight_layout()
plt.savefig("scores_hist.png", dpi=150)
plt.show()

Always call plt.tight_layout() before saving, it prevents labels from getting clipped.

Python patterns you'll keep using

A few Python features come up constantly in ML code:

# List comprehensions, build arrays quickly
squares = [x**2 for x in range(10)]

# Unpacking
X_train, X_test = data[:800], data[800:]

# enumerate, index + value together
for i, val in enumerate(scores):
    print(i, val)

# f-strings for clean output
print(f"Mean score: {scores.mean():.2f}")

# with statement for file I/O (auto-closes)
with open("log.txt", "w") as f:
    f.write("done\n")

Hands-on

Let us build the score-analysis script end-to-end. First, create a small CSV manually (in a real project this comes from a database or API).

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# --- Step 1: Create sample data ---
rng = np.random.default_rng(seed=42)
n = 200
data = {
    "student_id": range(1, n + 1),
    "math":       rng.integers(40, 100, size=n),
    "science":    rng.integers(35, 100, size=n),
    "english":    rng.integers(45, 100, size=n),
}
df = pd.DataFrame(data)
df.to_csv("scores.csv", index=False)

# --- Step 2: Load and inspect ---
df = pd.read_csv("scores.csv")
print(df.shape)          # (200, 4)
print(df.isnull().sum()) # no missing values here

# --- Step 3: NumPy operations ---
scores_np = df[["math", "science", "english"]].to_numpy()
# shape: (200, 3)

subject_means = scores_np.mean(axis=0)   # mean per column
student_means = scores_np.mean(axis=1)   # mean per row
print("Subject means:", subject_means)

# Normalise to 0-1 range using broadcasting
col_min = scores_np.min(axis=0)
col_max = scores_np.max(axis=0)
normalised = (scores_np - col_min) / (col_max - col_min)
print("Normalised min/max:", normalised.min(), normalised.max())

# --- Step 4: Add derived columns ---
df["average"] = student_means
df["grade"] = np.where(df["average"] >= 75, "Pass", "Fail")

pass_rate = (df["grade"] == "Pass").mean() * 100
print(f"Pass rate: {pass_rate:.1f}%")

# --- Step 5: Plot ---
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
for ax, col in zip(axes, ["math", "science", "english"]):
    ax.hist(df[col], bins=15, edgecolor="black", color="steelblue", alpha=0.8)
    ax.axvline(df[col].mean(), color="red", linestyle="--", label="mean")
    ax.set_title(col.capitalize())
    ax.set_xlabel("Score")
    ax.legend()

plt.suptitle("Subject Score Distributions", fontsize=14)
plt.tight_layout()
plt.savefig("score_distributions.png", dpi=150)
print("Saved score_distributions.png")

What is happening here:

rng.integers is the modern NumPy random API. Prefer np.random.default_rng over the old np.random.seed style.
axis=0 means "collapse rows, keep columns". axis=1 means "collapse columns, keep rows".
np.where(condition, value_if_true, value_if_false) is the vectorised if-else.
zip(axes, [...]) pairs each subplot axis with a column name, a clean pattern for multiple plots.

Common pitfalls

Mixing Python lists and NumPy arrays silently. If a is a list and b is a NumPy array, a + b may produce unexpected results. Always convert early: a = np.array(a).

Using .values vs .to_numpy(). Both extract the underlying array from a DataFrame, but .to_numpy() is the modern, preferred form. .values is legacy and can return unexpected types on extension arrays.

Forgetting axis in aggregations. df.mean() averages each column. df.mean(axis=1) averages each row. Getting this wrong silently produces wrong shapes that propagate far.

Overwriting the original DataFrame in loops. Prefer df["new_col"] = ... over reassigning df itself inside a loop, you'll lose reference to the original.

plt.show() blocks execution. In scripts (not notebooks), plt.show() is a blocking call. Save the figure before showing, or skip show() entirely in automated pipelines.

What to try next

Load a real CSV from the UCI Machine Learning Repository and replicate the hands-on script on it., Try seaborn for prettier statistical plots, it sits on top of matplotlib., Read about pandas.cut() and pandas.qcut() for bucketing continuous variables into categories., Move on to Lesson 2 (Maths Foundations) where you will see why vectors and matrices are everywhere in ML.