P2-M06 - ML Workflow & Feature Engineering

Part 2 — ML Workflow & Feature Engineering · Module 6 of 28

ML Workflow & Feature Engineering

Scale, encode, transform, and build your first sklearn Pipeline — the foundation of every ML project

⏱ 2 Weeks 🟡 Intermediate 🔧 scikit-learn · pandas · numpy · Pipeline 📋 Prerequisite: P2-M05

🎯

What This Module Covers

Part 2

Raw data can rarely be fed directly into a model. This module covers the preprocessing and engineering steps that transform your DataFrame into model-ready features — and the sklearn Pipeline that makes these steps reproducible, leak-free, and composable.

Feature engineering — creating new features that capture domain knowledge
Scaling — StandardScaler, MinMaxScaler, RobustScaler — when and why
Encoding — label encoding, one-hot encoding, ordinal encoding, target encoding
Train-test split — stratified split, data leakage, the golden rule
Cross-validation — K-Fold, StratifiedKFold, leave-one-out
sklearn Pipeline — chaining preprocessing and model into a single reusable object

💡 The Pipeline is the most important sklearn abstraction. It guarantees that your scaler is fit only on training data (not test data), that your encoder handles unseen categories, and that your entire preprocessing stack can be serialised, deployed, and reloaded in production.

🔧

Feature Engineering — Creating Better Inputs

Domain Knowledge

import pandas as pd, numpy as np

df = pd.read_csv("house_prices.csv")

# ── Numeric transformations ─────────────────────────
# Log transform: compress right-skewed features
df["SalePrice_log"]   = np.log1p(df["SalePrice"])
df["GrLivArea_log"]   = np.log1p(df["GrLivArea"])
df["LotArea_log"]     = np.log1p(df["LotArea"])

# Square root: lighter compression than log
df["LotFrontage_sqrt"] = np.sqrt(df["LotFrontage"].fillna(0))

# Polynomial features: capture non-linear relationships
df["GrLivArea_sq"] = df["GrLivArea"] ** 2   # quadratic term

# ── Combining features ───────────────────────────────
# Domain insight: total bathrooms = full + half*0.5
df["TotalBath"] = (df["FullBath"] + df["BsmtFullBath"].fillna(0)
                   + 0.5 * (df["HalfBath"] + df["BsmtHalfBath"].fillna(0)))

# Total square footage
df["TotalSF"] = (df["TotalBsmtSF"].fillna(0) +
                 df["1stFlrSF"] + df["2ndFlrSF"])

# House age and remodel age
df["HouseAge"]  = df["YrSold"] - df["YearBuilt"]
df["RemodelAge"] = df["YrSold"] - df["YearRemodAdd"]

# Has garage? (binary from numeric)
df["HasGarage"] = (df["GarageArea"] > 0).astype(int)
df["HasPool"]   = (df["PoolArea"]   > 0).astype(int)
df["HasBsmt"]   = (df["TotalBsmtSF"].fillna(0) > 0).astype(int)

# ── Interaction features ─────────────────────────────
# Quality × Size: premium-size product captures luxury segment
df["QualArea"] = df["OverallQual"] * df["GrLivArea"]

# ── Binning continuous into ordinal ──────────────────
df["AgeBin"] = pd.cut(df["HouseAge"],
    bins=[0, 10, 20, 40, 80, 200],
    labels=["New", "Recent", "Middle", "Old", "Very Old"])

# ── Missing value handling ───────────────────────────
# Numeric: fill with median (robust to outliers)
for col in ["LotFrontage", "GarageYrBlt", "MasVnrArea"]:
    df[col].fillna(df[col].median(), inplace=True)

# Categorical: fill with most frequent or "None" string
for col in ["BsmtQual", "BsmtCond", "GarageType", "FireplaceQu"]:
    df[col].fillna("None", inplace=True)

print(f"Features created. Shape: {df.shape}")

💡 Feature engineering is where domain expertise translates to model performance. TotalSF (total square footage) outperforms individual floor areas because it captures what buyers actually care about — total usable space. The best features come from asking "what would a human expert look at to value this house?"

📏

Feature Scaling — When, Why, and Which

Preprocessing

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np

# ── StandardScaler: z-score normalisation ────────────
# Transforms each feature to mean=0, std=1
# x_new = (x - mean) / std
# USE WHEN: logistic regression, SVM, neural networks, PCA
# AVOID: tree-based models (Random Forest, XGBoost) — trees don't need scaling

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # fit AND transform
X_test_scaled  = scaler.transform(X_test)        # transform ONLY (never fit on test!)

# ── MinMaxScaler: scale to [0, 1] range ──────────────
# x_new = (x - min) / (max - min)
# USE WHEN: neural networks (bounded activations), KNN
# AVOID: when test data may exceed training range (extrapolation issues)

mm_scaler = MinMaxScaler(feature_range=(0, 1))
X_train_mm = mm_scaler.fit_transform(X_train)

# ── RobustScaler: median + IQR (outlier-robust) ───────
# x_new = (x - median) / IQR
# USE WHEN: data has significant outliers (medical, financial)
# Better than StandardScaler when outliers are present

rob = RobustScaler()
X_robust = rob.fit_transform(X_train)

# ── CRITICAL: fit on train, transform on test ─────────
# WRONG — causes data leakage:
# scaler.fit_transform(X_full)  # scaler sees test statistics!

# RIGHT:
scaler.fit(X_train)           # learn statistics from training data only
X_train_s = scaler.transform(X_train)
X_test_s  = scaler.transform(X_test)  # apply those statistics to test

# ── Do tree models need scaling? ──────────────────────
# NO. Decision trees split on thresholds — scale doesn't matter.
# Random Forest, XGBoost, LightGBM: no scaling needed.
# Logistic Regression, SVM, KNN, Neural Nets: MUST scale.

print(f"Before scaling: mean={X_train[:, 0].mean():.1f}, std={X_train[:, 0].std():.1f}")
print(f"After scaling:  mean={X_train_scaled[:, 0].mean():.4f}, std={X_train_scaled[:, 0].std():.4f}")

💡 The most common preprocessing mistake is fitting the scaler on the entire dataset before splitting. If you scale using the test set's statistics, the model has implicitly "seen" the test data — this inflates your validation metrics and your real-world performance will be worse. Always fit preprocessing objects only on training data.

🔢

Categorical Encoding

Encoding Strategies

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder
from sklearn.preprocessing import TargetEncoder  # sklearn >= 1.3

# ── 1. One-Hot Encoding (OHE) ─────────────────────────
# Creates N binary columns (or N-1 to avoid multicollinearity)
# USE FOR: nominal categories (no natural order): color, city, genre
# AVOID FOR: high cardinality (>20 categories) — too many columns

ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")
X_ohe = ohe.fit_transform(df[["MSZoning", "SaleType"]])
print(f"OHE output shape: {X_ohe.shape}")
print(f"Feature names: {ohe.get_feature_names_out()[:5]}")

# With pandas (simpler for exploration)
df_ohe = pd.get_dummies(df[["MSZoning", "SaleType"]], drop_first=True)

# ── 2. Ordinal Encoding ───────────────────────────────
# Maps categories to integers preserving order
# USE FOR: categories with natural ranking: poor<fair<good<excellent

quality_order = ["None", "Po", "Fa", "TA", "Gd", "Ex"]
ord_enc = OrdinalEncoder(categories=[quality_order], handle_unknown="use_encoded_value",
                          unknown_value=-1)
df["ExterQual_enc"] = ord_enc.fit_transform(df[["ExterQual"]])

# Manual ordinal mapping (most explicit)
qual_map = {"None": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}
for col in ["BsmtQual", "KitchenQual", "GarageQual"]:
    df[f"{col}_ord"] = df[col].map(qual_map).fillna(0)

# ── 3. Target Encoding (mean encoding) ────────────────
# Replace each category with the mean target value for that category
# USE FOR: high-cardinality categoricals (Neighborhood: 25 values)
# MUST use cross-validation to avoid target leakage

from sklearn.preprocessing import TargetEncoder  # sklearn >= 1.3
te = TargetEncoder(cv=5, smooth="auto")  # cv=5 prevents leakage
X_te = te.fit_transform(df[["Neighborhood"]], df["SalePrice"])

# Manual with K-fold (scikit-learn < 1.3)
from sklearn.model_selection import KFold

def target_encode_kfold(df, cat_col, target_col, n_splits=5):
    result = df[cat_col].copy()
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    for train_idx, val_idx in kf.split(df):
        means = df.iloc[train_idx].groupby(cat_col)[target_col].mean()
        result.iloc[val_idx] = df.iloc[val_idx][cat_col].map(means)
    return result.fillna(df[target_col].mean())

df["Neighborhood_te"] = target_encode_kfold(df, "Neighborhood", "SalePrice")

# ── Summary: when to use what ─────────────────────────
# OHE:    nominal, low cardinality (<20 categories)
# Ordinal: ordered categories (quality ratings)
# Target:  high cardinality when tree models are used

✂

Train-Test Split and Data Leakage

Critical Concept

from sklearn.model_selection import train_test_split
import pandas as pd

# ── Basic split ───────────────────────────────────────
X = df.drop(columns=["SalePrice"])
y = df["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,     # 80% train, 20% test
    random_state=42    # reproducible split
)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")

# ── Stratified split for classification ──────────────
# Ensures class distribution is preserved in both sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y_class,
    test_size=0.2,
    stratify=y_class,  # preserve class proportions
    random_state=42
)
# Check class balance is preserved:
print("Train:", y_train.value_counts(normalize=True))
print("Test: ", y_test.value_counts(normalize=True))

# ── Data leakage — THE most important concept ─────────
# Leakage: test-set information contaminating training
# Sources:
#   1. Scaling on full dataset before split
#   2. Imputing with full-dataset statistics before split
#   3. Feature engineering using future data
#   4. Selecting features based on test-set correlation

# ── How to check for leakage ──────────────────────────
# Suspiciously high train accuracy (>99%) with low test accuracy
# Features with correlation >0.95 to target (might be derived from target)
# Model performance "too good to be true"

# Check for suspicious correlations
corr = df.corr(numeric_only=True)["SalePrice"].sort_values(ascending=False)
suspicious = corr[corr > 0.95].drop("SalePrice")
if len(suspicious):
    print("WARNING: Potentially leaky features:")
    print(suspicious)

⚠️ The Golden Rule: your test set is a time capsule from the future. You are not allowed to look at it until final evaluation. Never fit your scaler, imputer, or encoder on the full dataset — always fit on training data only, then apply to test. Using the test set at any point during preprocessing inflates your estimates of generalisation performance.

🔁

Cross-Validation — Reliable Model Evaluation

Evaluation

from sklearn.model_selection import (cross_val_score, KFold,
                                     StratifiedKFold, cross_validate)
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np

model = Ridge(alpha=1.0)

# ── K-Fold cross-validation ───────────────────────────
# Train/eval k times on different non-overlapping folds
# Final score = mean ± std over k folds
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring="neg_root_mean_squared_error")
rmse_scores = -scores   # negate back to positive
print(f"CV RMSE: {rmse_scores.mean():.0f} ± {rmse_scores.std():.0f}")

# ── Multiple metrics at once ──────────────────────────
results = cross_validate(model, X_num, y, cv=5, scoring={
    "r2":   "r2",
    "rmse": "neg_root_mean_squared_error",
    "mae":  "neg_mean_absolute_error",
}, return_train_score=True)
print(f"CV R²:   {results['test_r2'].mean():.3f}")
print(f"CV RMSE: {-results['test_rmse'].mean():.0f}")
# Compare train vs test: if train >> test, you are overfitting

# ── StratifiedKFold for classification ────────────────
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(classifier, X, y_class, cv=skf, scoring="f1_macro")
print(f"Stratified CV F1: {scores.mean():.3f} ± {scores.std():.3f}")

# ── CV INSIDE a Pipeline (correct) ────────────────────
# Pipeline ensures scaler is fit on training folds ONLY
pipe = Pipeline([("scaler", StandardScaler()), ("model", Ridge(alpha=1.0))])
cv_scores = cross_val_score(pipe, X_num, y, cv=5, scoring="r2")
print(f"Pipeline CV R²: {cv_scores.mean():.3f}")

💡 K=5 is the standard choice. K=5 gives 5x fewer training samples than the full dataset — acceptable bias. K=10 gives slightly better estimates but takes 2× longer. K=3 is faster but noisier. For small datasets (<500 rows), consider leave-one-out (LOOCV). Always look at the standard deviation — a mean of 0.85 ± 0.01 is far more reliable than 0.85 ± 0.12.

🏗

sklearn Pipeline — The Production Pattern

Best Practice

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
import pandas as pd
import joblib

# ── Define column types ───────────────────────────────
numeric_cols = ["GrLivArea", "TotalBsmtSF", "OverallQual", "HouseAge", "TotalBath"]
low_card_cat = ["MSZoning", "SaleType", "HeatingQC"]
high_card_cat = ["Neighborhood"]

# ── Numeric pipeline: impute then scale ───────────────
numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),   # fill NaN with column median
    ("scaler",  RobustScaler()),                     # scale robust to outliers
])

# ── Categorical pipeline: impute then encode ─────────
categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),  # fill NaN with mode
    ("ohe",     OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")),
])

# ── Combine with ColumnTransformer ────────────────────
preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_transformer, numeric_cols),
    ("cat", categorical_transformer, low_card_cat),
], remainder="drop")

# ── Full Pipeline: preprocessing + model ─────────────
pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("model",        Ridge(alpha=10.0)),
])

# ── Train the whole thing ─────────────────────────────
pipeline.fit(X_train, y_train)    # fits preprocessors on X_train, trains model

# ── Evaluate ──────────────────────────────────────────
from sklearn.metrics import root_mean_squared_error, r2_score
y_pred = pipeline.predict(X_test)
print(f"RMSE: {root_mean_squared_error(y_test, y_pred):,.0f}")
print(f"R²:   {r2_score(y_test, y_pred):.4f}")

# ── Cross-validate the whole Pipeline ─────────────────
from sklearn.model_selection import cross_val_score
cv_r2 = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="r2")
print(f"CV R²: {cv_r2.mean():.3f} ± {cv_r2.std():.3f}")

# ── Save and reload ───────────────────────────────────
joblib.dump(pipeline, "house_price_pipeline.pkl")
loaded = joblib.load("house_price_pipeline.pkl")
print(f"Loaded pipeline prediction: {loaded.predict(X_test[:1])[0]:,.0f}")

💡 A Pipeline is not just convenience — it is correctness. Without a Pipeline, you will accidentally leak preprocessing statistics. With a Pipeline, calling pipeline.fit(X_train, y_train) fits your scaler, imputer, and encoder only on X_train. Calling pipeline.predict(X_test) applies those learned transformations without refitting. This is the only correct way to build a preprocessing + model stack.

FREE LEARNING RESOURCES

Type	Resource	Best For
Course	Kaggle Intermediate ML Course (Free) — kaggle.com/learn/intermediate-machine-learning	Best coverage of Pipelines, missing values, categorical encoding, and data leakage.
Docs	Scikit-learn Preprocessing Guide — scikit-learn.org/stable/modules/preprocessing.html	Complete reference for all sklearn scalers, encoders, and transformers with examples.
Docs	Scikit-learn ColumnTransformer — scikit-learn.org/stable/modules/compose.html	Official guide on combining multiple transformers with ColumnTransformer and Pipeline.
Course	Kaggle Feature Engineering Course — kaggle.com/learn/feature-engineering	Mutual information, target encoding, and creating features. Practical exercises.
Video	StatQuest — Cross-Validation and Pipeline (YouTube)	Visual explanation of K-Fold cross-validation and why it matters. Clear and memorable.
Article	Data Leakage in Machine Learning — machinelearningmastery.com	Comprehensive guide to all types of data leakage with examples. Essential reading.

🛠House Prices — Full Preprocessing Pipeline[Intermediate] 4–5 days

Build a complete preprocessing + Ridge regression pipeline for the House Prices dataset.

Requirements

Feature engineering — create TotalSF, HouseAge, TotalBath, HasGarage, QualArea
Missing values — impute numeric with median, categorical with "None" or mode
Encoding — OHE for low-cardinality nominals, ordinal encoding for quality columns
Scaling — RobustScaler on numeric features
Pipeline — ColumnTransformer + Ridge(alpha=10) in a single Pipeline object
Evaluation — 5-fold CV reporting mean RMSE ± std
Target transform — fit on log(SalePrice), invert with np.expm1() for final predictions

Expected CV RMSE: ~$25,000–$30,000 on raw price. Compare: what RMSE do you get on log price?

LAB 1

Scaling Comparison

Take GrLivArea with its outliers. Apply StandardScaler, MinMaxScaler, and RobustScaler. Plot all 3 resulting distributions. How does each handle the outliers at GrLivArea > 4000?

Intentionally leak: fit a StandardScaler on X_full (before split), then use those statistics to scale X_train and X_test. Compare the mean/std of X_test — they should be non-zero with leaked scaler. With correct pipeline, they should be ~0/1.

Train a Ridge regression with correctly scaled vs leaked scaled data. Compare test R² scores. Does leakage always inflate performance on this dataset?

LAB 2

Encoding Comparison

Encode the Neighborhood column (25 categories) three ways: OHE, label encoding, target encoding (K-fold). Train Ridge regression with each. Which gives the best cross-validated R²?

Apply OHE to all nominal columns. How many columns does the feature matrix become? Use get_feature_names_out() to list all OHE features.

Apply ordinal encoding to ExterQual, KitchenQual, BsmtQual using the order Po < Fa < TA < Gd < Ex. Compute Spearman correlation with SalePrice for each encoded column. Compare to raw OHE.

LAB 3

Pipeline Build

Build the full ColumnTransformer pipeline (numeric: impute+scale; categorical: impute+OHE). Print the pipeline structure with pipeline.named_steps and check the output shape after fit_transform.

Run 5-fold CV on the full pipeline with Ridge. Print mean RMSE and std. Then change to alpha=0.1 and alpha=100. Which alpha gives the best CV RMSE?

Save the pipeline with joblib.dump(). Reload it and make a prediction on a manually constructed row (dict of feature values). Verify the prediction matches what the un-saved pipeline produces.

P2-M06 MASTERY CHECKLIST

Can create new numeric features: log transforms, square root, polynomial, interaction terms
Can combine columns to create domain-meaningful features (TotalSF, TotalBath, HouseAge)
Can apply SimpleImputer with median (numeric) and most_frequent (categorical) strategies
Know when to use StandardScaler vs MinMaxScaler vs RobustScaler
Know that tree models (Random Forest, XGBoost) do NOT need scaling
Know the Golden Rule: fit preprocessing ONLY on training data, never on full dataset
Can apply OHE for nominal and ordinal encoding for ordered categories
Can apply target encoding with K-fold to avoid leakage on high-cardinality columns
Can perform stratified train-test split with correct random_state for reproducibility
Can run K-Fold cross-validation and report mean ± std for multiple metrics
Can build a Pipeline combining ColumnTransformer preprocessor + model
Can save a Pipeline with joblib.dump() and reload with joblib.load()
Can cross-validate the entire Pipeline (guarantees leak-free evaluation)
Completed project: House Prices preprocessing pipeline with 5-fold CV RMSE reported

✅ When complete: Move to P3-M07 — Regression: linear, ridge, lasso, and polynomial regression — your first predictive models.

← P2-M05: Stats & Viz 🗺️ All Modules Next: P3-M07 — Regression →