Data Handling

What you'll build

By the end of this lesson you will have a complete data preparation pipeline: load a CSV, handle missing values, scale numeric features, encode categorical ones, and produce clean train/validation/test splits that are ready for any scikit-learn model.

Concepts

Loading CSV and Parquet files

CSV is the most common tabular format. Parquet is the efficient columnar format used in production data pipelines.

import pandas as pd

# CSV, human-readable, slow for large files
df_csv = pd.read_csv("data.csv")

# Parquet, compressed, column-oriented, fast
df_parquet = pd.read_parquet("data.parquet")

# Useful options for CSV
df = pd.read_csv(
    "data.csv",
    parse_dates=["signup_date"],   # convert column to datetime
    dtype={"user_id": str},        # force a specific type
    na_values=["N/A", "none", "?"] # mark extra strings as NaN
)

print(df.info())         # column names, non-null counts, dtypes
print(df.describe())     # statistics for numeric columns
print(df.isnull().sum()) # missing value count per column

For very large files that do not fit in memory, read in chunks:

chunks = []
for chunk in pd.read_csv("big.csv", chunksize=100_000):
    chunks.append(chunk[chunk["value"] > 0])  # filter each chunk
df = pd.concat(chunks, ignore_index=True)

Handling missing values

Do not drop missing rows by default. Think about why data is missing first.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    "age":    [25, np.nan, 35, 40, np.nan],
    "salary": [50000, 60000, np.nan, 80000, 55000],
    "city":   ["Delhi", "Mumbai", np.nan, "Chennai", "Pune"],
})

# Strategy 1: drop rows with any missing value
df_dropped = df.dropna()

# Strategy 2: fill with a constant
df["city"] = df["city"].fillna("Unknown")

# Strategy 3: fill numeric columns with the median (robust to outliers)
df["age"] = df["age"].fillna(df["age"].median())

# Strategy 4: forward-fill (for time-series)
df["salary"] = df["salary"].ffill()

print(df.isnull().sum())  # should be 0 after filling

Use median (not mean) to fill numeric columns when outliers are present. Mean is pulled towards extreme values; median is not.

Feature scaling

Most ML algorithms (linear models, SVMs, neural networks, kNN) are sensitive to the scale of features. Tree-based models are the exception, they are scale-invariant.

from sklearn.preprocessing import StandardScaler, MinMaxScaler

X = df[["age", "salary"]].values

# StandardScaler: (x - mean) / std
# Result has mean 0 and std 1
scaler_std = StandardScaler()
X_std = scaler_std.fit_transform(X)
print(X_std.mean(axis=0))  # approx [0. 0.]
print(X_std.std(axis=0))   # approx [1. 1.]

# MinMaxScaler: (x - min) / (max - min)
# Result is in [0, 1]
scaler_mm = MinMaxScaler()
X_mm = scaler_mm.fit_transform(X)
print(X_mm.min(axis=0))    # [0. 0.]
print(X_mm.max(axis=0))    # [1. 1.]

Important: fit the scaler on the training set only. Then transform both train and test. If you fit on the whole dataset, you leak test information into training, a subtle but common bug.

Encoding categorical variables

ML models need numbers. Converting categories to numbers requires care.

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

df = pd.DataFrame({
    "city":   ["Delhi", "Mumbai", "Delhi", "Chennai", "Mumbai"],
    "grade":  ["A", "B", "A", "C", "B"],
})

# Option 1: Label encoding, assigns an integer to each category.
# Use only for ordinal categories (A < B < C makes sense).
le = LabelEncoder()
df["grade_enc"] = le.fit_transform(df["grade"])

# Option 2: One-hot encoding, creates a binary column per category.
# Use for nominal categories (no inherent order).
df_onehot = pd.get_dummies(df, columns=["city"], drop_first=True)
print(df_onehot.head())

drop_first=True drops one category to avoid perfect multicollinearity (the "dummy variable trap"). If you have k categories and keep all k columns, any one column is a linear combination of the others, this causes problems for linear models.

Train / validation / test splits

Never tune or evaluate your model on the training set.

from sklearn.model_selection import train_test_split
import numpy as np

X = np.random.randn(1000, 5)
y = np.random.randint(0, 2, size=1000)

# First split off a test set, never touch it until final evaluation
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)

# Then split the remainder into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.176, random_state=42, stratify=y_temp
)
# 0.176 of 0.85 is approximately 0.15 of the original dataset

print(f"Train: {len(X_train)}  Val: {len(X_val)}  Test: {len(X_test)}")

stratify=y ensures class proportions are preserved in each split. This is critical for imbalanced datasets.

Hands-on

Let us build a complete pipeline using the Titanic dataset (a classic ML benchmark). We will load it, clean it, scale it, encode it, and produce ready-to-use splits.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# --- Load ---
# Download from: https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv
# Or use seaborn's built-in copy
import seaborn as sns
titanic = sns.load_dataset("titanic")

df = titanic.copy()
print(df.shape)           # (891, 15)
print(df.isnull().sum())

# --- Select features and target ---
features = ["pclass", "sex", "age", "sibsp", "parch", "fare", "embarked"]
target   = "survived"

df = df[features + [target]].copy()
print(df.isnull().sum())

# --- Handle missing values ---
df["age"]      = df["age"].fillna(df["age"].median())
df["embarked"] = df["embarked"].fillna(df["embarked"].mode()[0])

# --- Encode categoricals ---
df["sex_enc"]      = (df["sex"] == "male").astype(int)
df["embarked_C"]   = (df["embarked"] == "C").astype(int)
df["embarked_Q"]   = (df["embarked"] == "Q").astype(int)
# S is the baseline (dropped to avoid dummy trap)

df.drop(columns=["sex", "embarked"], inplace=True)
print(df.dtypes)

# --- Split ---
X = df.drop(columns=[target]).values
y = df[target].values

X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.176, random_state=42, stratify=y_temp
)

# --- Scale (fit on train only!) ---
scaler  = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val   = scaler.transform(X_val)
X_test  = scaler.transform(X_test)

print(f"Train: {X_train.shape}  Val: {X_val.shape}  Test: {X_test.shape}")
print("Train mean (should be ~0):", X_train.mean(axis=0).round(2))
print("Test mean (will NOT be exactly 0):", X_test.mean(axis=0).round(2))

Notice that the test set mean is not exactly zero, that is correct. The scaler parameters come only from the training data. Fitting on the full dataset would be data leakage.

Common pitfalls

Fitting the scaler on the full dataset. This leaks test statistics into training. Always call scaler.fit_transform(X_train) and then scaler.transform(X_val) and scaler.transform(X_test).

Dropping rows with missing values without thinking. If you have 20% missing values in a column, dropping those rows removes a fifth of your data and may introduce bias (if the missingness is not random).

Encoding ordinal variables with one-hot. If a category has natural order (small < medium < large), one-hot encoding discards that information. Use an ordinal mapping instead.

Not stratifying the split on imbalanced data. If 5% of samples are class 1, a random split could put no class 1 samples in the validation set. stratify=y prevents this.

Treating date columns as strings. Extract features from dates: year, month, day of week, days since a reference date. A raw date string is useless to most models.

What to try next

Download the Kaggle House Prices dataset and apply this pipeline to it., Explore sklearn.impute.SimpleImputer and KNNImputer for more sophisticated missing value strategies., Read about sklearn.compose.ColumnTransformer, it lets you apply different preprocessing to different columns in a single pipeline., Move on to Lesson 4 (Supervised Learning) where you will feed this clean data into actual ML models.