P3-M09 - Ensembles: XGBoost, LightGBM, SMOTE & Optuna

Part 3 — Classical ML · Module 9 of 28

Ensembles: XGBoost, LightGBM, SMOTE & Optuna

Gradient boosting, stacking, imbalanced data strategies, and automated hyperparameter optimisation

⏱ 2 Weeks 🟡 Intermediate–Advanced 🔧 xgboost · lightgbm · optuna · imbalanced-learn 📋 Prerequisite: P3-M08

🎯

What This Module Covers

Part 3

Gradient boosting models (XGBoost, LightGBM, CatBoost) dominate Kaggle structured-data competitions. This module covers these industry-standard tools, automated hyperparameter tuning with Optuna, advanced imbalanced-data strategies, and ensemble stacking.

XGBoost — gradient boosting, regularisation, early stopping, sklearn API
LightGBM — leaf-wise growth, categorical support, faster than XGBoost
Optuna — automated hyperparameter search with Bayesian optimisation
SMOTE variants — SMOTE, ADASYN, SMOTETomek, BorderlineSMOTE
Stacking and blending — combining model predictions as meta-features
Advanced SHAP — interaction values, dependence plots, force plots

💡 XGBoost is the starting model for nearly every structured-data ML problem. It handles missing values natively, is robust to outliers, requires minimal preprocessing (no scaling needed), and is fast. If XGBoost doesn't beat your baseline significantly, your problem may need feature engineering rather than a more complex model.

🚀

XGBoost — Gradient Boosting Deep Dive

Industry Standard

import xgboost as xgb
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import root_mean_squared_error, roc_auc_score
import pandas as pd, numpy as np

# ── How gradient boosting works ───────────────────────
# 1. Fit a shallow tree (weak learner) to the data
# 2. Compute residuals: where did the model err?
# 3. Fit NEXT tree to the residuals (learns from mistakes)
# 4. Add this tree to the ensemble with a learning rate
# 5. Repeat N times (n_estimators)
# Final prediction = sum of all tree outputs

# ── XGBoost sklearn API (easier) ─────────────────────
from xgboost import XGBClassifier, XGBRegressor

# Regression
model = XGBRegressor(
    n_estimators=500,       # number of boosting rounds
    learning_rate=0.05,     # how much each tree contributes (smaller = needs more trees)
    max_depth=5,            # depth of each tree (shallower = more regularisation)
    subsample=0.8,          # fraction of rows per tree (row sampling)
    colsample_bytree=0.8,   # fraction of features per tree (feature sampling)
    reg_alpha=0.1,          # L1 regularisation (Lasso-like)
    reg_lambda=1.0,         # L2 regularisation (Ridge-like)
    min_child_weight=5,     # minimum sum of weights in a leaf (controls overfitting)
    random_state=42,
    n_jobs=-1,
    # Missing values: XGBoost handles natively — no imputation needed!
)

# ── Early stopping: prevent overfitting automatically ─
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train,
                                              test_size=0.15, random_state=42)
model.fit(
    X_tr, y_tr,
    eval_set=[(X_val, y_val)],
    verbose=50  # print every 50 rounds
)
# Training stops when validation metric hasn't improved for N rounds
# model.best_ntree_limit: optimal number of trees found

print(f"Best iteration: {model.best_iteration}")
print(f"Test RMSE: {root_mean_squared_error(y_test, model.predict(X_test)):,.0f}")

# ── Cross-validation with early stopping ─────────────
import xgboost as xgb

dtrain = xgb.DMatrix(X_train, label=y_train)
params = {
    "max_depth": 5,
    "learning_rate": 0.05,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "reg_lambda": 1.0,
    "objective": "reg:squarederror",
    "eval_metric": "rmse",
    "seed": 42,
}
cv_results = xgb.cv(params, dtrain, num_boost_round=500,
                    nfold=5, early_stopping_rounds=30, verbose_eval=50)
print(f"Best CV RMSE: {cv_results['test-rmse-mean'].min():,.1f}")

Key XGBoost Parameters to Tune

n_estimators + learning_rate — always tune together. Lower lr needs more trees. Start: lr=0.1, trees=300. Then lr=0.01, trees=3000.
max_depth — 3-8. Deeper = more complex interactions. Default=6 is usually good.
subsample + colsample_bytree — 0.6-0.9. Stochastic sampling reduces overfitting.
min_child_weight — 1-20. Higher = more conservative splits. Tune for imbalanced data.
scale_pos_weight — for classification: sum(neg)/sum(pos). Critical for imbalanced classes.

💡

LightGBM — Faster, Leaf-Wise Boosting

Production Choice

import lightgbm as lgb
from sklearn.model_selection import cross_val_score

# ── LightGBM vs XGBoost ───────────────────────────────
# LightGBM: leaf-wise tree growth (vs XGBoost level-wise)
# → faster training, better accuracy on large datasets
# → more prone to overfitting with small datasets (use num_leaves carefully)
# Native categorical feature support (no OHE needed!)
# Much faster on datasets > 100k rows

model_lgb = lgb.LGBMRegressor(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=-1,           # -1 = no limit (use num_leaves instead)
    num_leaves=31,          # key LightGBM parameter (≈2^max_depth)
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0,
    min_child_samples=20,   # equivalent to min_child_weight in XGBoost
    n_jobs=-1,
    random_state=42,
    verbose=-1,
)

# Native categorical support
# Specify categorical columns — LightGBM handles them without OHE
cat_features = ["MSZoning", "Neighborhood", "SaleType"]
df[cat_features] = df[cat_features].astype("category")

model_lgb.fit(
    X_tr, y_tr,
    eval_set=[(X_val, y_val)],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)]
)
print(f"Best iteration: {model_lgb.best_iteration_}")

# ── LightGBM cross-validation ─────────────────────────
lgb_train = lgb.Dataset(X_train, label=y_train)
params = {
    "learning_rate": 0.05,
    "num_leaves": 31,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "objective": "regression",
    "metric": "rmse",
    "verbose": -1,
}
cv_result = lgb.cv(params, lgb_train, num_boost_round=1000,
                   nfold=5, callbacks=[lgb.early_stopping(50)])
best_round = len(cv_result["valid rmse-mean"])
print(f"Best round: {best_round}, CV RMSE: {min(cv_result['valid rmse-mean']):,.1f}")

💡 Use LightGBM when your dataset has > 50,000 rows or > 100 features. It trains 5-20× faster than XGBoost on large datasets. Use XGBoost when you want the most well-documented, stable gradient boosting library with the largest community. Both are excellent — pick LightGBM for speed, XGBoost for documentation.

🎯

Optuna — Automated Hyperparameter Tuning

Bayesian Optimisation

import optuna
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor
import numpy as np

# ── Why Optuna beats GridSearchCV ─────────────────────
# GridSearch: exhaustively tries all combinations (exponential time)
# RandomSearch: random sampling (efficient but dumb)
# Optuna/Bayesian: uses past trials to guess promising regions
# → finds good params in far fewer trials than GridSearch

optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    """Called by Optuna for each trial. Returns the metric to optimise."""
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 100, 1000),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "max_depth": trial.suggest_int("max_depth", 3, 9),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
        "reg_alpha": trial.suggest_float("reg_alpha", 1e-4, 10.0, log=True),
        "reg_lambda": trial.suggest_float("reg_lambda", 1e-4, 10.0, log=True),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 20),
        "random_state": 42,
        "n_jobs": -1,
    }
    model = XGBRegressor(**params)
    # CV with 3 folds (faster for tuning — use 5 for final evaluation)
    scores = cross_val_score(model, X_train, y_train, cv=3,
                             scoring="neg_root_mean_squared_error")
    return -scores.mean()  # Optuna minimises by default, so negate RMSE

# ── Run Optuna study ──────────────────────────────────
study = optuna.create_study(direction="minimize",
                             sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=50, show_progress_bar=True)

best_params = study.best_params
print(f"Best RMSE: {study.best_value:,.0f}")
print(f"Best params: {best_params}")

# ── Use best params for final model ───────────────────
best_model = XGBRegressor(**best_params)
best_model.fit(X_train, y_train)
print(f"Test RMSE: {root_mean_squared_error(y_test, best_model.predict(X_test)):,.0f}")

# ── Visualise Optuna results ──────────────────────────
from optuna.visualization import (plot_optimization_history,
                                   plot_param_importances,
                                   plot_slice)
# Shows how RMSE improved over trials
fig = plot_optimization_history(study)
fig.show()

# Shows which hyperparameters had the most impact
fig = plot_param_importances(study)
fig.show()

💡 For production models, use 100-200 Optuna trials. The first 20-30 trials explore randomly; subsequent trials exploit the most promising regions. Set a timeout if you need time-bounded tuning: study.optimize(objective, timeout=3600) for 1 hour of tuning.

⚗

SMOTE Variants for Imbalanced Data

Imbalanced

from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from imblearn.combine import SMOTETomek, SMOTEENN
from imblearn.pipeline import Pipeline as ImbPipeline
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report

# ── SMOTE — Synthetic Minority Oversampling ───────────
# Creates synthetic minority class samples by interpolating between
# k nearest neighbours of existing minority samples
# Result: balanced classes (50/50 by default)
smote = SMOTE(random_state=42, k_neighbors=5, sampling_strategy=1.0)

# ── ADASYN — Adaptive Synthetic Sampling ─────────────
# Like SMOTE but creates MORE synthetic samples near the decision boundary
# (where the classifier struggles most)
adasyn = ADASYN(random_state=42, n_neighbors=5)

# ── BorderlineSMOTE ───────────────────────────────────
# Only oversamples minority points near the decision boundary
# More targeted than vanilla SMOTE
bl_smote = BorderlineSMOTE(random_state=42, kind="borderline-1")

# ── SMOTETomek: oversample + undersample ──────────────
# Apply SMOTE to create synthetic minority samples
# Then remove Tomek links (ambiguous majority samples near boundary)
# Best of both: less noisy than pure SMOTE
smote_tomek = SMOTETomek(random_state=42)

# ── XGBoost alternative: scale_pos_weight ─────────────
# For binary classification: no resampling needed
# scale_pos_weight = sum(negative) / sum(positive)
neg = (y_train == 0).sum()
pos = (y_train == 1).sum()
scale = neg / pos
print(f"scale_pos_weight: {scale:.2f}")

xgb_imb = XGBClassifier(scale_pos_weight=scale, n_estimators=300,
                          random_state=42, n_jobs=-1)

# ── Compare strategies with CV ────────────────────────
strategies = {
    "XGB no correction":   ImbPipeline([("xgb", XGBClassifier(n_estimators=300, random_state=42))]),
    "XGB scale_pos":       ImbPipeline([("xgb", XGBClassifier(scale_pos_weight=scale, n_estimators=300, random_state=42))]),
    "SMOTE + XGB":         ImbPipeline([("smote", SMOTE(random_state=42)), ("xgb", XGBClassifier(n_estimators=300, random_state=42))]),
    "SMOTETomek + XGB":    ImbPipeline([("smote", SMOTETomek(random_state=42)), ("xgb", XGBClassifier(n_estimators=300, random_state=42))]),
}
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for name, pipe in strategies.items():
    f1 = cross_val_score(pipe, X_train, y_train, cv=skf, scoring="f1").mean()
    print(f"{name:30s}: F1 = {f1:.3f}")

🔗

Stacking and Blending

Advanced Ensembles

from sklearn.ensemble import StackingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score

# ── Voting Ensemble ───────────────────────────────────
# Simplest ensemble: combine predictions from multiple models
# hard voting: majority class vote
# soft voting: average of predicted probabilities (better)
voting = VotingClassifier(estimators=[
    ("rf",  RandomForestClassifier(n_estimators=100, random_state=42)),
    ("xgb", XGBClassifier(n_estimators=200, random_state=42)),
    ("lr",  LogisticRegression(max_iter=1000)),
], voting="soft", n_jobs=-1)

cv_voting = cross_val_score(voting, X_train, y_train, cv=5, scoring="roc_auc")
print(f"Voting CV AUC: {cv_voting.mean():.3f}")

# ── Stacking ──────────────────────────────────────────
# Level-0 estimators: base models, trained on K-fold subsets
# Level-1 estimator: meta-learner trained on base model predictions
# Stacking uses cross-validation to generate level-0 predictions
# to avoid the meta-learner overfitting to training data

stacking = StackingClassifier(
    estimators=[
        ("rf",  RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)),
        ("xgb", XGBClassifier(n_estimators=200, random_state=42)),
        ("gb",  GradientBoostingClassifier(n_estimators=100, random_state=42)),
    ],
    final_estimator=LogisticRegression(C=0.1, max_iter=1000),
    cv=5,           # K-folds for generating level-0 predictions
    stack_method="predict_proba",  # use probabilities as meta-features
    n_jobs=-1,
    passthrough=False,  # True: also pass original features to meta-learner
)
cv_stacking = cross_val_score(stacking, X_train, y_train, cv=5, scoring="roc_auc")
print(f"Stacking CV AUC: {cv_stacking.mean():.3f}")

# ── Manual blending (simpler, less rigorous than stacking) ──
# Train models on train split, blend predictions on val split
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2)
proba_rf  = RandomForestClassifier(n_estimators=100).fit(X_tr, y_tr).predict_proba(X_val)[:, 1]
proba_xgb = XGBClassifier(n_estimators=200).fit(X_tr, y_tr).predict_proba(X_val)[:, 1]
blend = 0.5 * proba_rf + 0.5 * proba_xgb  # simple average
print(f"Blend AUC: {roc_auc_score(y_val, blend):.3f}")

📊

Advanced SHAP — Deep Interpretability

Interpretability

import shap
import matplotlib.pyplot as plt
import numpy as np

# ── SHAP for tree models (exact, fast) ────────────────
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer(X_test)

# shap_values.values: shape (n_samples, n_features)
# shap_values.base_values: baseline prediction (average model output)
# shap_values.data: feature values

# ── Summary plot (global feature importance + direction) ─
shap.summary_plot(shap_values, X_test)
# Each row = one feature. Points = individual samples.
# Red = high feature value, Blue = low feature value
# x-axis: SHAP value (positive = pushes prediction higher)

# ── Bar plot (average |SHAP| per feature) ────────────
shap.summary_plot(shap_values, X_test, plot_type="bar")

# ── Dependence plot: how one feature interacts with another ──
# Shows: SHAP(GrLivArea) vs GrLivArea, coloured by OverallQual
shap.dependence_plot("GrLivArea", shap_values.values, X_test,
                      interaction_index="OverallQual")

# ── Waterfall plot for a single prediction ────────────
# Why did the model predict $250k for this specific house?
idx = 0
shap.plots.waterfall(shap_values[idx])

# ── Force plot: interactive individual prediction ─────
shap.force_plot(explainer.expected_value, shap_values.values[idx],
                X_test.iloc[idx], matplotlib=True)

# ── SHAP for non-tree models (KernelExplainer - slow) ─
# Use when you need SHAP for non-tree models
# KernelSHAP approximates SHAP values using a weighted linear model
explainer_lr = shap.KernelExplainer(lr_model.predict, shap.kmeans(X_train, 50))
shap_lr = explainer_lr.shap_values(X_test[:100])  # small batch (slow)

FREE LEARNING RESOURCES

Type	Resource	Best For
Docs	XGBoost Documentation — xgboost.readthedocs.io	Complete XGBoost reference. Parameter explanations, tutorials, Python API. Authoritative.
Docs	LightGBM Documentation — lightgbm.readthedocs.io	Parameters, performance tips, categorical feature support, full Python API.
Docs	Optuna Documentation — optuna.readthedocs.io	Bayesian hyperparameter optimisation. Tutorials, samplers, pruners, visualisation.
Course	Kaggle Intermediate ML — XGBoost section — kaggle.com/learn/intermediate-machine-learning	Hands-on XGBoost with Kaggle exercises. Covers missing values, cross-validation integration.
Dataset	Credit Card Fraud — Kaggle	Severe class imbalance (0.17% fraud). Perfect for SMOTE, scale_pos_weight, and F1 vs AUC comparison.
Dataset	Porto Seguro Safe Driver — Kaggle	Industry-standard XGBoost/LightGBM benchmark. Kaggle competition with public discussion.

🛠Credit Card Fraud Detection — Full Pipeline[Advanced] 6–7 days

Build a production-ready fraud detection system with all the techniques from this module.

Requirements

EDA — class imbalance analysis (only 0.17% fraud), feature distributions by class
Baseline — Logistic Regression. Report: accuracy, precision, recall, F1, ROC-AUC. Note that accuracy is misleading.
XGBoost with scale_pos_weight — tune with Optuna (50 trials). Report CV F1.
SMOTE + XGBoost — use ImbPipeline, compare to scale_pos_weight approach
Threshold tuning — find optimal threshold on validation set using PR curve
SHAP analysis — which features drive fraud predictions? Surprise any you?
Final results table — all models, all metrics, final chosen model with justification

Target: F1 > 0.85 on test set. Report both F1 and ROC-AUC (both matter for fraud).

LAB 1

XGBoost Early Stopping

Train XGBoost on House Prices with n_estimators=2000 and learning_rate=0.01. Use early stopping on a 15% validation split. What is the optimal number of rounds? Compare to n_estimators=200 without early stopping.

Plot train RMSE and validation RMSE vs boosting round. At what round does overfitting begin? Annotate on the plot with plt.axvline().

Run XGBoost CV using xgb.cv(). Compare the best CV RMSE to the early stopping result. Which finds the better model?

LAB 2

Optuna Tuning

Run Optuna with 50 trials on XGBoost for House Prices. Plot the optimisation history. At what trial does the curve flatten? What does this tell you about when to stop searching?

Run plot_param_importances(study). Which hyperparameter has the most impact on RMSE? Is it what you expected?

Compare: default XGBoost, manually tuned XGBoost, and Optuna-tuned XGBoost on 5-fold CV RMSE. How much does Optuna improve over manual tuning?

P3-M09 MASTERY CHECKLIST

Can explain gradient boosting: sequential trees, residual fitting, learning rate
Can train XGBoost with early stopping and find the optimal number of boosting rounds
Know the key XGBoost hyperparameters: n_estimators, learning_rate, max_depth, subsample, colsample_bytree
Know when to use LightGBM vs XGBoost (large datasets → LightGBM)
Can set up an Optuna study with suggest_int, suggest_float, and suggest_float(log=True)
Can interpret Optuna plot_optimization_history and plot_param_importances
Know at least 3 SMOTE variants and when to use each
Can use scale_pos_weight for imbalanced XGBoost classification
Can compare SMOTE vs class weighting vs no correction with StratifiedKFold CV
Can build a StackingClassifier with sklearn and understand why K-fold is needed
Can generate SHAP summary plot, dependence plot, and waterfall plot
Completed project: Credit Card Fraud Detection with Optuna tuning and SHAP analysis

✅ When complete: Move to P3-M10 — Unsupervised Learning: K-Means, PCA, t-SNE, and customer segmentation.

← P3-M08: Classification 🗺️ All Modules Next: P3-M10 — Unsupervised →