What This Module Covers
Part 3Gradient boosting models (XGBoost, LightGBM, CatBoost) dominate Kaggle structured-data competitions. This module covers these industry-standard tools, automated hyperparameter tuning with Optuna, advanced imbalanced-data strategies, and ensemble stacking.
- XGBoost — gradient boosting, regularisation, early stopping, sklearn API
- LightGBM — leaf-wise growth, categorical support, faster than XGBoost
- Optuna — automated hyperparameter search with Bayesian optimisation
- SMOTE variants — SMOTE, ADASYN, SMOTETomek, BorderlineSMOTE
- Stacking and blending — combining model predictions as meta-features
- Advanced SHAP — interaction values, dependence plots, force plots
💡 XGBoost is the starting model for nearly every structured-data ML problem. It handles missing values natively, is robust to outliers, requires minimal preprocessing (no scaling needed), and is fast. If XGBoost doesn't beat your baseline significantly, your problem may need feature engineering rather than a more complex model.
XGBoost — Gradient Boosting Deep Dive
Industry Standardimport xgboost as xgb
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import root_mean_squared_error, roc_auc_score
import pandas as pd, numpy as np
# ── How gradient boosting works ───────────────────────
# 1. Fit a shallow tree (weak learner) to the data
# 2. Compute residuals: where did the model err?
# 3. Fit NEXT tree to the residuals (learns from mistakes)
# 4. Add this tree to the ensemble with a learning rate
# 5. Repeat N times (n_estimators)
# Final prediction = sum of all tree outputs
# ── XGBoost sklearn API (easier) ─────────────────────
from xgboost import XGBClassifier, XGBRegressor
# Regression
model = XGBRegressor(
n_estimators=500, # number of boosting rounds
learning_rate=0.05, # how much each tree contributes (smaller = needs more trees)
max_depth=5, # depth of each tree (shallower = more regularisation)
subsample=0.8, # fraction of rows per tree (row sampling)
colsample_bytree=0.8, # fraction of features per tree (feature sampling)
reg_alpha=0.1, # L1 regularisation (Lasso-like)
reg_lambda=1.0, # L2 regularisation (Ridge-like)
min_child_weight=5, # minimum sum of weights in a leaf (controls overfitting)
random_state=42,
n_jobs=-1,
# Missing values: XGBoost handles natively — no imputation needed!
)
# ── Early stopping: prevent overfitting automatically ─
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train,
test_size=0.15, random_state=42)
model.fit(
X_tr, y_tr,
eval_set=[(X_val, y_val)],
verbose=50 # print every 50 rounds
)
# Training stops when validation metric hasn't improved for N rounds
# model.best_ntree_limit: optimal number of trees found
print(f"Best iteration: {model.best_iteration}")
print(f"Test RMSE: {root_mean_squared_error(y_test, model.predict(X_test)):,.0f}")
# ── Cross-validation with early stopping ─────────────
import xgboost as xgb
dtrain = xgb.DMatrix(X_train, label=y_train)
params = {
"max_depth": 5,
"learning_rate": 0.05,
"subsample": 0.8,
"colsample_bytree": 0.8,
"reg_lambda": 1.0,
"objective": "reg:squarederror",
"eval_metric": "rmse",
"seed": 42,
}
cv_results = xgb.cv(params, dtrain, num_boost_round=500,
nfold=5, early_stopping_rounds=30, verbose_eval=50)
print(f"Best CV RMSE: {cv_results['test-rmse-mean'].min():,.1f}")Key XGBoost Parameters to Tune
- n_estimators + learning_rate — always tune together. Lower lr needs more trees. Start: lr=0.1, trees=300. Then lr=0.01, trees=3000.
- max_depth — 3-8. Deeper = more complex interactions. Default=6 is usually good.
- subsample + colsample_bytree — 0.6-0.9. Stochastic sampling reduces overfitting.
- min_child_weight — 1-20. Higher = more conservative splits. Tune for imbalanced data.
- scale_pos_weight — for classification: sum(neg)/sum(pos). Critical for imbalanced classes.
LightGBM — Faster, Leaf-Wise Boosting
Production Choiceimport lightgbm as lgb
from sklearn.model_selection import cross_val_score
# ── LightGBM vs XGBoost ───────────────────────────────
# LightGBM: leaf-wise tree growth (vs XGBoost level-wise)
# → faster training, better accuracy on large datasets
# → more prone to overfitting with small datasets (use num_leaves carefully)
# Native categorical feature support (no OHE needed!)
# Much faster on datasets > 100k rows
model_lgb = lgb.LGBMRegressor(
n_estimators=1000,
learning_rate=0.05,
max_depth=-1, # -1 = no limit (use num_leaves instead)
num_leaves=31, # key LightGBM parameter (≈2^max_depth)
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1,
reg_lambda=1.0,
min_child_samples=20, # equivalent to min_child_weight in XGBoost
n_jobs=-1,
random_state=42,
verbose=-1,
)
# Native categorical support
# Specify categorical columns — LightGBM handles them without OHE
cat_features = ["MSZoning", "Neighborhood", "SaleType"]
df[cat_features] = df[cat_features].astype("category")
model_lgb.fit(
X_tr, y_tr,
eval_set=[(X_val, y_val)],
callbacks=[lgb.early_stopping(50), lgb.log_evaluation(100)]
)
print(f"Best iteration: {model_lgb.best_iteration_}")
# ── LightGBM cross-validation ─────────────────────────
lgb_train = lgb.Dataset(X_train, label=y_train)
params = {
"learning_rate": 0.05,
"num_leaves": 31,
"subsample": 0.8,
"colsample_bytree": 0.8,
"objective": "regression",
"metric": "rmse",
"verbose": -1,
}
cv_result = lgb.cv(params, lgb_train, num_boost_round=1000,
nfold=5, callbacks=[lgb.early_stopping(50)])
best_round = len(cv_result["valid rmse-mean"])
print(f"Best round: {best_round}, CV RMSE: {min(cv_result['valid rmse-mean']):,.1f}")💡 Use LightGBM when your dataset has > 50,000 rows or > 100 features. It trains 5-20× faster than XGBoost on large datasets. Use XGBoost when you want the most well-documented, stable gradient boosting library with the largest community. Both are excellent — pick LightGBM for speed, XGBoost for documentation.
Optuna — Automated Hyperparameter Tuning
Bayesian Optimisationimport optuna
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor
import numpy as np
# ── Why Optuna beats GridSearchCV ─────────────────────
# GridSearch: exhaustively tries all combinations (exponential time)
# RandomSearch: random sampling (efficient but dumb)
# Optuna/Bayesian: uses past trials to guess promising regions
# → finds good params in far fewer trials than GridSearch
optuna.logging.set_verbosity(optuna.logging.WARNING)
def objective(trial):
"""Called by Optuna for each trial. Returns the metric to optimise."""
params = {
"n_estimators": trial.suggest_int("n_estimators", 100, 1000),
"learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
"max_depth": trial.suggest_int("max_depth", 3, 9),
"subsample": trial.suggest_float("subsample", 0.5, 1.0),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.5, 1.0),
"reg_alpha": trial.suggest_float("reg_alpha", 1e-4, 10.0, log=True),
"reg_lambda": trial.suggest_float("reg_lambda", 1e-4, 10.0, log=True),
"min_child_weight": trial.suggest_int("min_child_weight", 1, 20),
"random_state": 42,
"n_jobs": -1,
}
model = XGBRegressor(**params)
# CV with 3 folds (faster for tuning — use 5 for final evaluation)
scores = cross_val_score(model, X_train, y_train, cv=3,
scoring="neg_root_mean_squared_error")
return -scores.mean() # Optuna minimises by default, so negate RMSE
# ── Run Optuna study ──────────────────────────────────
study = optuna.create_study(direction="minimize",
sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=50, show_progress_bar=True)
best_params = study.best_params
print(f"Best RMSE: {study.best_value:,.0f}")
print(f"Best params: {best_params}")
# ── Use best params for final model ───────────────────
best_model = XGBRegressor(**best_params)
best_model.fit(X_train, y_train)
print(f"Test RMSE: {root_mean_squared_error(y_test, best_model.predict(X_test)):,.0f}")
# ── Visualise Optuna results ──────────────────────────
from optuna.visualization import (plot_optimization_history,
plot_param_importances,
plot_slice)
# Shows how RMSE improved over trials
fig = plot_optimization_history(study)
fig.show()
# Shows which hyperparameters had the most impact
fig = plot_param_importances(study)
fig.show()💡 For production models, use 100-200 Optuna trials. The first 20-30 trials explore randomly; subsequent trials exploit the most promising regions. Set a timeout if you need time-bounded tuning: study.optimize(objective, timeout=3600) for 1 hour of tuning.
SMOTE Variants for Imbalanced Data
Imbalancedfrom imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from imblearn.combine import SMOTETomek, SMOTEENN
from imblearn.pipeline import Pipeline as ImbPipeline
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import classification_report
# ── SMOTE — Synthetic Minority Oversampling ───────────
# Creates synthetic minority class samples by interpolating between
# k nearest neighbours of existing minority samples
# Result: balanced classes (50/50 by default)
smote = SMOTE(random_state=42, k_neighbors=5, sampling_strategy=1.0)
# ── ADASYN — Adaptive Synthetic Sampling ─────────────
# Like SMOTE but creates MORE synthetic samples near the decision boundary
# (where the classifier struggles most)
adasyn = ADASYN(random_state=42, n_neighbors=5)
# ── BorderlineSMOTE ───────────────────────────────────
# Only oversamples minority points near the decision boundary
# More targeted than vanilla SMOTE
bl_smote = BorderlineSMOTE(random_state=42, kind="borderline-1")
# ── SMOTETomek: oversample + undersample ──────────────
# Apply SMOTE to create synthetic minority samples
# Then remove Tomek links (ambiguous majority samples near boundary)
# Best of both: less noisy than pure SMOTE
smote_tomek = SMOTETomek(random_state=42)
# ── XGBoost alternative: scale_pos_weight ─────────────
# For binary classification: no resampling needed
# scale_pos_weight = sum(negative) / sum(positive)
neg = (y_train == 0).sum()
pos = (y_train == 1).sum()
scale = neg / pos
print(f"scale_pos_weight: {scale:.2f}")
xgb_imb = XGBClassifier(scale_pos_weight=scale, n_estimators=300,
random_state=42, n_jobs=-1)
# ── Compare strategies with CV ────────────────────────
strategies = {
"XGB no correction": ImbPipeline([("xgb", XGBClassifier(n_estimators=300, random_state=42))]),
"XGB scale_pos": ImbPipeline([("xgb", XGBClassifier(scale_pos_weight=scale, n_estimators=300, random_state=42))]),
"SMOTE + XGB": ImbPipeline([("smote", SMOTE(random_state=42)), ("xgb", XGBClassifier(n_estimators=300, random_state=42))]),
"SMOTETomek + XGB": ImbPipeline([("smote", SMOTETomek(random_state=42)), ("xgb", XGBClassifier(n_estimators=300, random_state=42))]),
}
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for name, pipe in strategies.items():
f1 = cross_val_score(pipe, X_train, y_train, cv=skf, scoring="f1").mean()
print(f"{name:30s}: F1 = {f1:.3f}")Stacking and Blending
Advanced Ensemblesfrom sklearn.ensemble import StackingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
# ── Voting Ensemble ───────────────────────────────────
# Simplest ensemble: combine predictions from multiple models
# hard voting: majority class vote
# soft voting: average of predicted probabilities (better)
voting = VotingClassifier(estimators=[
("rf", RandomForestClassifier(n_estimators=100, random_state=42)),
("xgb", XGBClassifier(n_estimators=200, random_state=42)),
("lr", LogisticRegression(max_iter=1000)),
], voting="soft", n_jobs=-1)
cv_voting = cross_val_score(voting, X_train, y_train, cv=5, scoring="roc_auc")
print(f"Voting CV AUC: {cv_voting.mean():.3f}")
# ── Stacking ──────────────────────────────────────────
# Level-0 estimators: base models, trained on K-fold subsets
# Level-1 estimator: meta-learner trained on base model predictions
# Stacking uses cross-validation to generate level-0 predictions
# to avoid the meta-learner overfitting to training data
stacking = StackingClassifier(
estimators=[
("rf", RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)),
("xgb", XGBClassifier(n_estimators=200, random_state=42)),
("gb", GradientBoostingClassifier(n_estimators=100, random_state=42)),
],
final_estimator=LogisticRegression(C=0.1, max_iter=1000),
cv=5, # K-folds for generating level-0 predictions
stack_method="predict_proba", # use probabilities as meta-features
n_jobs=-1,
passthrough=False, # True: also pass original features to meta-learner
)
cv_stacking = cross_val_score(stacking, X_train, y_train, cv=5, scoring="roc_auc")
print(f"Stacking CV AUC: {cv_stacking.mean():.3f}")
# ── Manual blending (simpler, less rigorous than stacking) ──
# Train models on train split, blend predictions on val split
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2)
proba_rf = RandomForestClassifier(n_estimators=100).fit(X_tr, y_tr).predict_proba(X_val)[:, 1]
proba_xgb = XGBClassifier(n_estimators=200).fit(X_tr, y_tr).predict_proba(X_val)[:, 1]
blend = 0.5 * proba_rf + 0.5 * proba_xgb # simple average
print(f"Blend AUC: {roc_auc_score(y_val, blend):.3f}")Advanced SHAP — Deep Interpretability
Interpretabilityimport shap
import matplotlib.pyplot as plt
import numpy as np
# ── SHAP for tree models (exact, fast) ────────────────
explainer = shap.TreeExplainer(xgb_model)
shap_values = explainer(X_test)
# shap_values.values: shape (n_samples, n_features)
# shap_values.base_values: baseline prediction (average model output)
# shap_values.data: feature values
# ── Summary plot (global feature importance + direction) ─
shap.summary_plot(shap_values, X_test)
# Each row = one feature. Points = individual samples.
# Red = high feature value, Blue = low feature value
# x-axis: SHAP value (positive = pushes prediction higher)
# ── Bar plot (average |SHAP| per feature) ────────────
shap.summary_plot(shap_values, X_test, plot_type="bar")
# ── Dependence plot: how one feature interacts with another ──
# Shows: SHAP(GrLivArea) vs GrLivArea, coloured by OverallQual
shap.dependence_plot("GrLivArea", shap_values.values, X_test,
interaction_index="OverallQual")
# ── Waterfall plot for a single prediction ────────────
# Why did the model predict $250k for this specific house?
idx = 0
shap.plots.waterfall(shap_values[idx])
# ── Force plot: interactive individual prediction ─────
shap.force_plot(explainer.expected_value, shap_values.values[idx],
X_test.iloc[idx], matplotlib=True)
# ── SHAP for non-tree models (KernelExplainer - slow) ─
# Use when you need SHAP for non-tree models
# KernelSHAP approximates SHAP values using a weighted linear model
explainer_lr = shap.KernelExplainer(lr_model.predict, shap.kmeans(X_train, 50))
shap_lr = explainer_lr.shap_values(X_test[:100]) # small batch (slow)FREE LEARNING RESOURCES
| Type | Resource | Best For |
|---|---|---|
| Docs | XGBoost Documentation — xgboost.readthedocs.io | Complete XGBoost reference. Parameter explanations, tutorials, Python API. Authoritative. |
| Docs | LightGBM Documentation — lightgbm.readthedocs.io | Parameters, performance tips, categorical feature support, full Python API. |
| Docs | Optuna Documentation — optuna.readthedocs.io | Bayesian hyperparameter optimisation. Tutorials, samplers, pruners, visualisation. |
| Course | Kaggle Intermediate ML — XGBoost section — kaggle.com/learn/intermediate-machine-learning | Hands-on XGBoost with Kaggle exercises. Covers missing values, cross-validation integration. |
| Dataset | Credit Card Fraud — Kaggle | Severe class imbalance (0.17% fraud). Perfect for SMOTE, scale_pos_weight, and F1 vs AUC comparison. |
| Dataset | Porto Seguro Safe Driver — Kaggle | Industry-standard XGBoost/LightGBM benchmark. Kaggle competition with public discussion. |
Build a production-ready fraud detection system with all the techniques from this module.
Requirements
- EDA — class imbalance analysis (only 0.17% fraud), feature distributions by class
- Baseline — Logistic Regression. Report: accuracy, precision, recall, F1, ROC-AUC. Note that accuracy is misleading.
- XGBoost with scale_pos_weight — tune with Optuna (50 trials). Report CV F1.
- SMOTE + XGBoost — use ImbPipeline, compare to scale_pos_weight approach
- Threshold tuning — find optimal threshold on validation set using PR curve
- SHAP analysis — which features drive fraud predictions? Surprise any you?
- Final results table — all models, all metrics, final chosen model with justification
Target: F1 > 0.85 on test set. Report both F1 and ROC-AUC (both matter for fraud).
XGBoost Early Stopping
Optuna Tuning
P3-M09 MASTERY CHECKLIST
- Can explain gradient boosting: sequential trees, residual fitting, learning rate
- Can train XGBoost with early stopping and find the optimal number of boosting rounds
- Know the key XGBoost hyperparameters: n_estimators, learning_rate, max_depth, subsample, colsample_bytree
- Know when to use LightGBM vs XGBoost (large datasets → LightGBM)
- Can set up an Optuna study with suggest_int, suggest_float, and suggest_float(log=True)
- Can interpret Optuna plot_optimization_history and plot_param_importances
- Know at least 3 SMOTE variants and when to use each
- Can use scale_pos_weight for imbalanced XGBoost classification
- Can compare SMOTE vs class weighting vs no correction with StratifiedKFold CV
- Can build a StackingClassifier with sklearn and understand why K-fold is needed
- Can generate SHAP summary plot, dependence plot, and waterfall plot
- Completed project: Credit Card Fraud Detection with Optuna tuning and SHAP analysis
✅ When complete: Move to P3-M10 — Unsupervised Learning: K-Means, PCA, t-SNE, and customer segmentation.