What This Module Covers
Part 2Raw data can rarely be fed directly into a model. This module covers the preprocessing and engineering steps that transform your DataFrame into model-ready features — and the sklearn Pipeline that makes these steps reproducible, leak-free, and composable.
- Feature engineering — creating new features that capture domain knowledge
- Scaling — StandardScaler, MinMaxScaler, RobustScaler — when and why
- Encoding — label encoding, one-hot encoding, ordinal encoding, target encoding
- Train-test split — stratified split, data leakage, the golden rule
- Cross-validation — K-Fold, StratifiedKFold, leave-one-out
- sklearn Pipeline — chaining preprocessing and model into a single reusable object
💡 The Pipeline is the most important sklearn abstraction. It guarantees that your scaler is fit only on training data (not test data), that your encoder handles unseen categories, and that your entire preprocessing stack can be serialised, deployed, and reloaded in production.
Feature Engineering — Creating Better Inputs
Domain Knowledgeimport pandas as pd, numpy as np
df = pd.read_csv("house_prices.csv")
# ── Numeric transformations ─────────────────────────
# Log transform: compress right-skewed features
df["SalePrice_log"] = np.log1p(df["SalePrice"])
df["GrLivArea_log"] = np.log1p(df["GrLivArea"])
df["LotArea_log"] = np.log1p(df["LotArea"])
# Square root: lighter compression than log
df["LotFrontage_sqrt"] = np.sqrt(df["LotFrontage"].fillna(0))
# Polynomial features: capture non-linear relationships
df["GrLivArea_sq"] = df["GrLivArea"] ** 2 # quadratic term
# ── Combining features ───────────────────────────────
# Domain insight: total bathrooms = full + half*0.5
df["TotalBath"] = (df["FullBath"] + df["BsmtFullBath"].fillna(0)
+ 0.5 * (df["HalfBath"] + df["BsmtHalfBath"].fillna(0)))
# Total square footage
df["TotalSF"] = (df["TotalBsmtSF"].fillna(0) +
df["1stFlrSF"] + df["2ndFlrSF"])
# House age and remodel age
df["HouseAge"] = df["YrSold"] - df["YearBuilt"]
df["RemodelAge"] = df["YrSold"] - df["YearRemodAdd"]
# Has garage? (binary from numeric)
df["HasGarage"] = (df["GarageArea"] > 0).astype(int)
df["HasPool"] = (df["PoolArea"] > 0).astype(int)
df["HasBsmt"] = (df["TotalBsmtSF"].fillna(0) > 0).astype(int)
# ── Interaction features ─────────────────────────────
# Quality × Size: premium-size product captures luxury segment
df["QualArea"] = df["OverallQual"] * df["GrLivArea"]
# ── Binning continuous into ordinal ──────────────────
df["AgeBin"] = pd.cut(df["HouseAge"],
bins=[0, 10, 20, 40, 80, 200],
labels=["New", "Recent", "Middle", "Old", "Very Old"])
# ── Missing value handling ───────────────────────────
# Numeric: fill with median (robust to outliers)
for col in ["LotFrontage", "GarageYrBlt", "MasVnrArea"]:
df[col].fillna(df[col].median(), inplace=True)
# Categorical: fill with most frequent or "None" string
for col in ["BsmtQual", "BsmtCond", "GarageType", "FireplaceQu"]:
df[col].fillna("None", inplace=True)
print(f"Features created. Shape: {df.shape}")💡 Feature engineering is where domain expertise translates to model performance. TotalSF (total square footage) outperforms individual floor areas because it captures what buyers actually care about — total usable space. The best features come from asking "what would a human expert look at to value this house?"
Feature Scaling — When, Why, and Which
Preprocessingfrom sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
import numpy as np
# ── StandardScaler: z-score normalisation ────────────
# Transforms each feature to mean=0, std=1
# x_new = (x - mean) / std
# USE WHEN: logistic regression, SVM, neural networks, PCA
# AVOID: tree-based models (Random Forest, XGBoost) — trees don't need scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # fit AND transform
X_test_scaled = scaler.transform(X_test) # transform ONLY (never fit on test!)
# ── MinMaxScaler: scale to [0, 1] range ──────────────
# x_new = (x - min) / (max - min)
# USE WHEN: neural networks (bounded activations), KNN
# AVOID: when test data may exceed training range (extrapolation issues)
mm_scaler = MinMaxScaler(feature_range=(0, 1))
X_train_mm = mm_scaler.fit_transform(X_train)
# ── RobustScaler: median + IQR (outlier-robust) ───────
# x_new = (x - median) / IQR
# USE WHEN: data has significant outliers (medical, financial)
# Better than StandardScaler when outliers are present
rob = RobustScaler()
X_robust = rob.fit_transform(X_train)
# ── CRITICAL: fit on train, transform on test ─────────
# WRONG — causes data leakage:
# scaler.fit_transform(X_full) # scaler sees test statistics!
# RIGHT:
scaler.fit(X_train) # learn statistics from training data only
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test) # apply those statistics to test
# ── Do tree models need scaling? ──────────────────────
# NO. Decision trees split on thresholds — scale doesn't matter.
# Random Forest, XGBoost, LightGBM: no scaling needed.
# Logistic Regression, SVM, KNN, Neural Nets: MUST scale.
print(f"Before scaling: mean={X_train[:, 0].mean():.1f}, std={X_train[:, 0].std():.1f}")
print(f"After scaling: mean={X_train_scaled[:, 0].mean():.4f}, std={X_train_scaled[:, 0].std():.4f}")💡 The most common preprocessing mistake is fitting the scaler on the entire dataset before splitting. If you scale using the test set's statistics, the model has implicitly "seen" the test data — this inflates your validation metrics and your real-world performance will be worse. Always fit preprocessing objects only on training data.
Categorical Encoding
Encoding Strategiesimport pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder
from sklearn.preprocessing import TargetEncoder # sklearn >= 1.3
# ── 1. One-Hot Encoding (OHE) ─────────────────────────
# Creates N binary columns (or N-1 to avoid multicollinearity)
# USE FOR: nominal categories (no natural order): color, city, genre
# AVOID FOR: high cardinality (>20 categories) — too many columns
ohe = OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")
X_ohe = ohe.fit_transform(df[["MSZoning", "SaleType"]])
print(f"OHE output shape: {X_ohe.shape}")
print(f"Feature names: {ohe.get_feature_names_out()[:5]}")
# With pandas (simpler for exploration)
df_ohe = pd.get_dummies(df[["MSZoning", "SaleType"]], drop_first=True)
# ── 2. Ordinal Encoding ───────────────────────────────
# Maps categories to integers preserving order
# USE FOR: categories with natural ranking: poor<fair<good<excellent
quality_order = ["None", "Po", "Fa", "TA", "Gd", "Ex"]
ord_enc = OrdinalEncoder(categories=[quality_order], handle_unknown="use_encoded_value",
unknown_value=-1)
df["ExterQual_enc"] = ord_enc.fit_transform(df[["ExterQual"]])
# Manual ordinal mapping (most explicit)
qual_map = {"None": 0, "Po": 1, "Fa": 2, "TA": 3, "Gd": 4, "Ex": 5}
for col in ["BsmtQual", "KitchenQual", "GarageQual"]:
df[f"{col}_ord"] = df[col].map(qual_map).fillna(0)
# ── 3. Target Encoding (mean encoding) ────────────────
# Replace each category with the mean target value for that category
# USE FOR: high-cardinality categoricals (Neighborhood: 25 values)
# MUST use cross-validation to avoid target leakage
from sklearn.preprocessing import TargetEncoder # sklearn >= 1.3
te = TargetEncoder(cv=5, smooth="auto") # cv=5 prevents leakage
X_te = te.fit_transform(df[["Neighborhood"]], df["SalePrice"])
# Manual with K-fold (scikit-learn < 1.3)
from sklearn.model_selection import KFold
def target_encode_kfold(df, cat_col, target_col, n_splits=5):
result = df[cat_col].copy()
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(df):
means = df.iloc[train_idx].groupby(cat_col)[target_col].mean()
result.iloc[val_idx] = df.iloc[val_idx][cat_col].map(means)
return result.fillna(df[target_col].mean())
df["Neighborhood_te"] = target_encode_kfold(df, "Neighborhood", "SalePrice")
# ── Summary: when to use what ─────────────────────────
# OHE: nominal, low cardinality (<20 categories)
# Ordinal: ordered categories (quality ratings)
# Target: high cardinality when tree models are usedTrain-Test Split and Data Leakage
Critical Conceptfrom sklearn.model_selection import train_test_split
import pandas as pd
# ── Basic split ───────────────────────────────────────
X = df.drop(columns=["SalePrice"])
y = df["SalePrice"]
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 80% train, 20% test
random_state=42 # reproducible split
)
print(f"Train: {len(X_train)}, Test: {len(X_test)}")
# ── Stratified split for classification ──────────────
# Ensures class distribution is preserved in both sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y_class,
test_size=0.2,
stratify=y_class, # preserve class proportions
random_state=42
)
# Check class balance is preserved:
print("Train:", y_train.value_counts(normalize=True))
print("Test: ", y_test.value_counts(normalize=True))
# ── Data leakage — THE most important concept ─────────
# Leakage: test-set information contaminating training
# Sources:
# 1. Scaling on full dataset before split
# 2. Imputing with full-dataset statistics before split
# 3. Feature engineering using future data
# 4. Selecting features based on test-set correlation
# ── How to check for leakage ──────────────────────────
# Suspiciously high train accuracy (>99%) with low test accuracy
# Features with correlation >0.95 to target (might be derived from target)
# Model performance "too good to be true"
# Check for suspicious correlations
corr = df.corr(numeric_only=True)["SalePrice"].sort_values(ascending=False)
suspicious = corr[corr > 0.95].drop("SalePrice")
if len(suspicious):
print("WARNING: Potentially leaky features:")
print(suspicious)⚠️ The Golden Rule: your test set is a time capsule from the future. You are not allowed to look at it until final evaluation. Never fit your scaler, imputer, or encoder on the full dataset — always fit on training data only, then apply to test. Using the test set at any point during preprocessing inflates your estimates of generalisation performance.
Cross-Validation — Reliable Model Evaluation
Evaluationfrom sklearn.model_selection import (cross_val_score, KFold,
StratifiedKFold, cross_validate)
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import numpy as np
model = Ridge(alpha=1.0)
# ── K-Fold cross-validation ───────────────────────────
# Train/eval k times on different non-overlapping folds
# Final score = mean ± std over k folds
kf = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring="neg_root_mean_squared_error")
rmse_scores = -scores # negate back to positive
print(f"CV RMSE: {rmse_scores.mean():.0f} ± {rmse_scores.std():.0f}")
# ── Multiple metrics at once ──────────────────────────
results = cross_validate(model, X_num, y, cv=5, scoring={
"r2": "r2",
"rmse": "neg_root_mean_squared_error",
"mae": "neg_mean_absolute_error",
}, return_train_score=True)
print(f"CV R²: {results['test_r2'].mean():.3f}")
print(f"CV RMSE: {-results['test_rmse'].mean():.0f}")
# Compare train vs test: if train >> test, you are overfitting
# ── StratifiedKFold for classification ────────────────
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(classifier, X, y_class, cv=skf, scoring="f1_macro")
print(f"Stratified CV F1: {scores.mean():.3f} ± {scores.std():.3f}")
# ── CV INSIDE a Pipeline (correct) ────────────────────
# Pipeline ensures scaler is fit on training folds ONLY
pipe = Pipeline([("scaler", StandardScaler()), ("model", Ridge(alpha=1.0))])
cv_scores = cross_val_score(pipe, X_num, y, cv=5, scoring="r2")
print(f"Pipeline CV R²: {cv_scores.mean():.3f}")💡 K=5 is the standard choice. K=5 gives 5x fewer training samples than the full dataset — acceptable bias. K=10 gives slightly better estimates but takes 2× longer. K=3 is faster but noisier. For small datasets (<500 rows), consider leave-one-out (LOOCV). Always look at the standard deviation — a mean of 0.85 ± 0.01 is far more reliable than 0.85 ± 0.12.
sklearn Pipeline — The Production Pattern
Best Practicefrom sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, RobustScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
import pandas as pd
import joblib
# ── Define column types ───────────────────────────────
numeric_cols = ["GrLivArea", "TotalBsmtSF", "OverallQual", "HouseAge", "TotalBath"]
low_card_cat = ["MSZoning", "SaleType", "HeatingQC"]
high_card_cat = ["Neighborhood"]
# ── Numeric pipeline: impute then scale ───────────────
numeric_transformer = Pipeline([
("imputer", SimpleImputer(strategy="median")), # fill NaN with column median
("scaler", RobustScaler()), # scale robust to outliers
])
# ── Categorical pipeline: impute then encode ─────────
categorical_transformer = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")), # fill NaN with mode
("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")),
])
# ── Combine with ColumnTransformer ────────────────────
preprocessor = ColumnTransformer(transformers=[
("num", numeric_transformer, numeric_cols),
("cat", categorical_transformer, low_card_cat),
], remainder="drop")
# ── Full Pipeline: preprocessing + model ─────────────
pipeline = Pipeline([
("preprocessor", preprocessor),
("model", Ridge(alpha=10.0)),
])
# ── Train the whole thing ─────────────────────────────
pipeline.fit(X_train, y_train) # fits preprocessors on X_train, trains model
# ── Evaluate ──────────────────────────────────────────
from sklearn.metrics import root_mean_squared_error, r2_score
y_pred = pipeline.predict(X_test)
print(f"RMSE: {root_mean_squared_error(y_test, y_pred):,.0f}")
print(f"R²: {r2_score(y_test, y_pred):.4f}")
# ── Cross-validate the whole Pipeline ─────────────────
from sklearn.model_selection import cross_val_score
cv_r2 = cross_val_score(pipeline, X_train, y_train, cv=5, scoring="r2")
print(f"CV R²: {cv_r2.mean():.3f} ± {cv_r2.std():.3f}")
# ── Save and reload ───────────────────────────────────
joblib.dump(pipeline, "house_price_pipeline.pkl")
loaded = joblib.load("house_price_pipeline.pkl")
print(f"Loaded pipeline prediction: {loaded.predict(X_test[:1])[0]:,.0f}")💡 A Pipeline is not just convenience — it is correctness. Without a Pipeline, you will accidentally leak preprocessing statistics. With a Pipeline, calling pipeline.fit(X_train, y_train) fits your scaler, imputer, and encoder only on X_train. Calling pipeline.predict(X_test) applies those learned transformations without refitting. This is the only correct way to build a preprocessing + model stack.
FREE LEARNING RESOURCES
| Type | Resource | Best For |
|---|---|---|
| Course | Kaggle Intermediate ML Course (Free) — kaggle.com/learn/intermediate-machine-learning | Best coverage of Pipelines, missing values, categorical encoding, and data leakage. |
| Docs | Scikit-learn Preprocessing Guide — scikit-learn.org/stable/modules/preprocessing.html | Complete reference for all sklearn scalers, encoders, and transformers with examples. |
| Docs | Scikit-learn ColumnTransformer — scikit-learn.org/stable/modules/compose.html | Official guide on combining multiple transformers with ColumnTransformer and Pipeline. |
| Course | Kaggle Feature Engineering Course — kaggle.com/learn/feature-engineering | Mutual information, target encoding, and creating features. Practical exercises. |
| Video | StatQuest — Cross-Validation and Pipeline (YouTube) | Visual explanation of K-Fold cross-validation and why it matters. Clear and memorable. |
| Article | Data Leakage in Machine Learning — machinelearningmastery.com | Comprehensive guide to all types of data leakage with examples. Essential reading. |
Build a complete preprocessing + Ridge regression pipeline for the House Prices dataset.
Requirements
- Feature engineering — create TotalSF, HouseAge, TotalBath, HasGarage, QualArea
- Missing values — impute numeric with median, categorical with "None" or mode
- Encoding — OHE for low-cardinality nominals, ordinal encoding for quality columns
- Scaling — RobustScaler on numeric features
- Pipeline — ColumnTransformer + Ridge(alpha=10) in a single Pipeline object
- Evaluation — 5-fold CV reporting mean RMSE ± std
- Target transform — fit on log(SalePrice), invert with np.expm1() for final predictions
Expected CV RMSE: ~$25,000–$30,000 on raw price. Compare: what RMSE do you get on log price?
Scaling Comparison
Encoding Comparison
Pipeline Build
P2-M06 MASTERY CHECKLIST
- Can create new numeric features: log transforms, square root, polynomial, interaction terms
- Can combine columns to create domain-meaningful features (TotalSF, TotalBath, HouseAge)
- Can apply SimpleImputer with median (numeric) and most_frequent (categorical) strategies
- Know when to use StandardScaler vs MinMaxScaler vs RobustScaler
- Know that tree models (Random Forest, XGBoost) do NOT need scaling
- Know the Golden Rule: fit preprocessing ONLY on training data, never on full dataset
- Can apply OHE for nominal and ordinal encoding for ordered categories
- Can apply target encoding with K-fold to avoid leakage on high-cardinality columns
- Can perform stratified train-test split with correct random_state for reproducibility
- Can run K-Fold cross-validation and report mean ± std for multiple metrics
- Can build a Pipeline combining ColumnTransformer preprocessor + model
- Can save a Pipeline with joblib.dump() and reload with joblib.load()
- Can cross-validate the entire Pipeline (guarantees leak-free evaluation)
- Completed project: House Prices preprocessing pipeline with 5-fold CV RMSE reported
✅ When complete: Move to P3-M07 — Regression: linear, ridge, lasso, and polynomial regression — your first predictive models.