What This Module Covers
Part 3Classification predicts which discrete category an input belongs to: spam or not spam, disease or healthy, which product category. This module covers the most important classical classification algorithms and — critically — how to evaluate them correctly.
- Decision Trees — splitting logic, Gini vs entropy, depth control, overfitting
- Random Forest — bagging, feature subsampling, out-of-bag score, tuning
- SVM (Support Vector Machine) — maximum margin classifier, kernels, C parameter
- Classification metrics — accuracy, precision, recall, F1, ROC-AUC, confusion matrix
- Imbalanced data — class weights, SMOTE, threshold tuning, F1 vs accuracy
- Feature importance — tree-based, permutation importance, SHAP values
💡 Accuracy is almost never the right metric. If 99% of transactions are legitimate and 1% are fraud, a model that predicts "legitimate" for everything gets 99% accuracy — and catches zero fraud. Use precision, recall, F1, and ROC-AUC for imbalanced problems.
Decision Trees — Interpretable Splitting
Foundationfrom sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import numpy as np
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
stratify=y, random_state=42)
# ── Basic decision tree ───────────────────────────────
# criterion="gini": Gini impurity (default, slightly faster)
# criterion="entropy": information gain (often similar results)
# max_depth: most important hyperparameter. None = overfit!
# min_samples_split: minimum samples needed to split a node
# min_samples_leaf: minimum samples required in a leaf node
dt = DecisionTreeClassifier(
max_depth=4, # limit depth to prevent overfitting
min_samples_leaf=5, # at least 5 samples per leaf
criterion="gini",
random_state=42
)
dt.fit(X_train, y_train)
print(f"Train accuracy: {dt.score(X_train, y_train):.4f}")
print(f"Test accuracy: {dt.score(X_test, y_test):.4f}")
# ── Visualise the tree ────────────────────────────────
fig, ax = plt.subplots(figsize=(20, 8))
plot_tree(dt, feature_names=X.columns, class_names=["malignant", "benign"],
filled=True, rounded=True, ax=ax, fontsize=8)
plt.tight_layout()
plt.savefig("decision_tree.png", dpi=100)
# Text representation (shareable without plot)
print(export_text(dt, feature_names=list(X.columns)))
# ── Overfitting demonstration ─────────────────────────
train_scores, test_scores = [], []
depths = range(1, 20)
for d in depths:
dt_d = DecisionTreeClassifier(max_depth=d, random_state=42)
dt_d.fit(X_train, y_train)
train_scores.append(dt_d.score(X_train, y_train))
test_scores.append(dt_d.score(X_test, y_test))
plt.figure(figsize=(8, 4))
plt.plot(depths, train_scores, label="Train", marker="o", markersize=4)
plt.plot(depths, test_scores, label="Test", marker="s", markersize=4)
plt.xlabel("Max Depth")
plt.ylabel("Accuracy")
plt.title("Decision Tree: Depth vs Accuracy (Bias-Variance Tradeoff)")
plt.legend()
plt.axvline(depths[np.argmax(test_scores)], color="red", linestyle="--",
label=f"Best depth={depths[np.argmax(test_scores)]}")💡 Decision trees with no depth limit overfit perfectly. They memorise every training example. The tree depth is the primary bias-variance dial: shallow = high bias (underfitting), deep = high variance (overfitting). max_depth=4 is a good starting point. Always compare train vs test accuracy — a gap of more than 5% suggests overfitting.
Random Forest — Ensemble of Trees
Ensemblefrom sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
import pandas as pd
# ── How Random Forest works ───────────────────────────
# 1. Bootstrap: sample N rows WITH replacement (bagging)
# 2. Feature subsampling: at each split, only consider sqrt(n_features) features
# 3. Train one deep tree per bootstrap sample
# 4. Predict: majority vote of all trees (reduces variance)
# Result: lower variance than single tree, still low bias
rf = RandomForestClassifier(
n_estimators=100, # number of trees (more = better until diminishing returns)
max_depth=None, # trees are grown deep (bagging reduces variance)
max_features="sqrt", # sqrt(n_features) features per split (default for clf)
min_samples_leaf=1, # default for RF (deep trees are fine)
n_jobs=-1, # use all CPU cores
random_state=42,
oob_score=True # out-of-bag evaluation (free validation!)
)
rf.fit(X_train, y_train)
print(f"OOB Score: {rf.oob_score_:.4f}") # no held-out set needed!
print(f"Train Acc: {rf.score(X_train, y_train):.4f}")
print(f"Test Acc: {rf.score(X_test, y_test):.4f}")
# Cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5, scoring="f1")
print(f"CV F1: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# ── Hyperparameter tuning ─────────────────────────────
# Key hyperparameters to tune:
# n_estimators: 100-500 (more is almost always better, just slower)
# max_depth: None or 10-30 (deep is OK for RF due to averaging)
# max_features: "sqrt", "log2", or float (0.3 = 30% of features)
# min_samples_leaf: 1-10 (increasing reduces overfitting)
param_grid = {
"n_estimators": [100, 200],
"max_features": ["sqrt", 0.3],
"min_samples_leaf": [1, 3, 5],
}
gs = GridSearchCV(RandomForestClassifier(random_state=42, n_jobs=-1),
param_grid, cv=3, scoring="f1", n_jobs=-1, verbose=1)
gs.fit(X_train, y_train)
print(f"Best params: {gs.best_params_}")
print(f"Best CV F1: {gs.best_score_:.4f}")💡 oob_score=True gives you a free validation score. Each tree in the forest is trained on ~63% of rows (bootstrap sample). The remaining ~37% (out-of-bag samples) are used to evaluate that tree's predictions — without any held-out set. The OOB score is a reliable estimate of generalisation performance and is fast to compute.
Support Vector Machine (SVM)
Maximum Marginfrom sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
from scipy.stats import loguniform
# ── SVM concept ───────────────────────────────────────
# Finds the hyperplane that maximises the margin between classes
# Support vectors: the training points closest to the decision boundary
# C parameter: tradeoff between margin width and misclassification
# Small C: wide margin, allows more misclassification (regularised)
# Large C: narrow margin, fewer misclassifications (overfit risk)
# Kernel: maps data to higher-dimensional space for non-linear boundaries
# ── CRITICAL: SVM requires feature scaling ────────────
svm_pipe = Pipeline([
("scaler", StandardScaler()), # SVM is NOT scale-invariant
("svm", SVC(kernel="rbf", C=1.0, gamma="scale",
probability=True, random_state=42)),
])
svm_pipe.fit(X_train, y_train)
print(f"Test Accuracy: {svm_pipe.score(X_test, y_test):.4f}")
# Get probabilities for ROC-AUC
proba = svm_pipe.predict_proba(X_test)[:, 1]
# ── Kernel choice ─────────────────────────────────────
# "linear": good for high-dimensional sparse data (text)
# "rbf": good default for most tabular data (radial basis function)
# "poly": polynomial kernel (degree parameter)
# ── Hyperparameter search ─────────────────────────────
# C and gamma interact — tune together with log-uniform distribution
param_dist = {
"svm__C": loguniform(0.01, 1000), # 0.01 to 1000
"svm__gamma": loguniform(1e-4, 1.0), # 1e-4 to 1.0
}
rs = RandomizedSearchCV(svm_pipe, param_dist, n_iter=20, cv=3,
scoring="f1", random_state=42)
rs.fit(X_train, y_train)
print(f"Best C: {rs.best_params_['svm__C']:.4f}")
print(f"Best gamma: {rs.best_params_['svm__gamma']:.6f}")
print(f"Best CV F1: {rs.best_score_:.4f}")💡 SVM with RBF kernel is often competitive with Random Forest on small-to-medium datasets. It is a strong default when: dataset < 10,000 rows, features are dense (tabular), you need probabilistic outputs. Its weakness: does not scale well to large datasets (O(n²) memory), slow to tune. For >50,000 rows, use tree-based models.
Classification Metrics — Beyond Accuracy
Criticalfrom sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix,
classification_report, RocCurveDisplay,
PrecisionRecallDisplay)
import matplotlib.pyplot as plt
import seaborn as sns
y_pred = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]
# ── The four classification outcomes ─────────────────
# True Positive (TP): predicted positive, actually positive
# True Negative (TN): predicted negative, actually negative
# False Positive (FP): predicted positive, actually negative (Type I error)
# False Negative (FN): predicted negative, actually positive (Type II error)
# ── Core metrics ──────────────────────────────────────
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred) # TP / (TP + FP)
recall = recall_score(y_test, y_pred) # TP / (TP + FN) = sensitivity
f1 = f1_score(y_test, y_pred) # harmonic mean of precision & recall
roc_auc = roc_auc_score(y_test, y_proba) # area under ROC curve
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f} (when I say positive, how often am I right?)")
print(f"Recall: {recall:.4f} (how many actual positives did I catch?)")
print(f"F1 Score: {f1:.4f} (harmonic mean of precision and recall)")
print(f"ROC-AUC: {roc_auc:.4f} (probability correct positive ranked above negative)")
# ── Confusion matrix ──────────────────────────────────
cm = confusion_matrix(y_test, y_pred)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
xticklabels=["Pred Neg", "Pred Pos"],
yticklabels=["Actual Neg", "Actual Pos"], ax=axes[0])
axes[0].set_title("Confusion Matrix")
# ROC curve
RocCurveDisplay.from_estimator(model, X_test, y_test, ax=axes[1])
axes[1].set_title(f"ROC Curve (AUC={roc_auc:.3f})")
plt.tight_layout()
# Full classification report
print(classification_report(y_test, y_pred, target_names=["Neg", "Pos"]))
# ── Threshold tuning ──────────────────────────────────
# Default threshold = 0.5. Adjust based on business need.
# Lower threshold (e.g. 0.3): catch more positives (higher recall, lower precision)
# Higher threshold (e.g. 0.7): more confident positives (higher precision, lower recall)
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
for t in thresholds:
y_t = (y_proba >= t).astype(int)
print(f"t={t:.1f}: precision={precision_score(y_test, y_t):.3f}, "
f"recall={recall_score(y_test, y_t):.3f}, "
f"f1={f1_score(y_test, y_t):.3f}")⚠️ When to use each metric: Accuracy — only for balanced classes. Precision — when false positives are costly (spam filter: don't block real emails). Recall — when false negatives are costly (cancer screening: don't miss actual cancer). F1 — when both FP and FN matter. ROC-AUC — when you need threshold-independent comparison across models.
Imbalanced Data — When Classes Are Skewed
Critical for Real Datafrom sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, classification_report
import numpy as np
# ── Detect class imbalance ────────────────────────────
print(pd.Series(y_train).value_counts())
print(pd.Series(y_train).value_counts(normalize=True))
# If 95% negative, 5% positive → severe imbalance
# ── Solution 1: class_weight="balanced" (easiest) ────
# Adjusts sample weights so each class contributes equally to loss
# No data modification, no SMOTE complexity
rf_balanced = RandomForestClassifier(
class_weight="balanced", # automatically weights minority class higher
n_estimators=100,
random_state=42, n_jobs=-1
)
rf_balanced.fit(X_train, y_train)
print(f"F1 (balanced weights): {f1_score(y_test, rf_balanced.predict(X_test)):.4f}")
# Compute weights manually
classes = np.unique(y_train)
weights = compute_class_weight("balanced", classes=classes, y=y_train)
weight_dict = dict(zip(classes, weights))
print(f"Class weights: {weight_dict}")
# ── Solution 2: SMOTE — Synthetic Minority Oversampling ──
# Generates synthetic minority-class samples by interpolating between
# existing minority samples in feature space
# USE: when minority class has <10% representation
smote = SMOTE(random_state=42, k_neighbors=5)
X_res, y_res = smote.fit_resample(X_train, y_train)
print(f"Before SMOTE: {pd.Series(y_train).value_counts().to_dict()}")
print(f"After SMOTE: {pd.Series(y_res).value_counts().to_dict()}")
# SMOTE MUST be inside a pipeline (not applied to test data)
imb_pipe = ImbPipeline([
("smote", SMOTE(random_state=42)),
("rf", RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
])
from sklearn.model_selection import cross_val_score
cv_f1 = cross_val_score(imb_pipe, X_train, y_train, cv=5, scoring="f1")
print(f"SMOTE Pipeline CV F1: {cv_f1.mean():.3f}")
# ── Solution 3: Threshold tuning ─────────────────────
# Default 0.5 threshold biased toward majority class
# For imbalanced data, optimal threshold is often lower
proba = rf_balanced.predict_proba(X_test)[:, 1]
from sklearn.metrics import precision_recall_curve
prec, rec, thresholds = precision_recall_curve(y_test, proba)
f1_scores = 2 * (prec * rec) / (prec + rec + 1e-9)
best_threshold = thresholds[np.argmax(f1_scores)]
print(f"Optimal threshold: {best_threshold:.3f}")Feature Importance and SHAP Values
Interpretabilityimport shap
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.inspection import permutation_importance
# ── Method 1: Tree-based feature importance ───────────
# Built into sklearn tree models
# Based on total reduction in Gini impurity due to each feature
# WARNING: biased toward high-cardinality features
importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.sort_values(ascending=False).head(15).plot(kind="barh")
plt.title("Random Forest Feature Importance")
# ── Method 2: Permutation Importance ─────────────────
# Shuffle each feature independently, measure performance drop
# More reliable than built-in importance, not biased by cardinality
# SLOW on large datasets
result = permutation_importance(rf, X_test, y_test, n_repeats=10,
random_state=42, scoring="f1")
perm_df = pd.DataFrame({"importance": result.importances_mean,
"std": result.importances_std},
index=X.columns).sort_values("importance", ascending=False)
print(perm_df.head(10))
# ── Method 3: SHAP Values ─────────────────────────────
# SHapley Additive exPlanations — game-theory-based feature contributions
# Explains EACH individual prediction (not just global importance)
# For tree models: exact (fast); for others: approximation
explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)
# Summary plot: global feature importance + direction
shap.summary_plot(shap_values[:, :, 1], X_test, plot_type="bar") # bar
shap.summary_plot(shap_values[:, :, 1], X_test) # beeswarm (better)
# Individual prediction explanation
idx = 0 # first test sample
shap.waterfall_plot(shap.Explanation(values=shap_values[1][idx],
base_values=explainer.expected_value[1],
data=X_test.iloc[idx]))FREE LEARNING RESOURCES
| Type | Resource | Best For |
|---|---|---|
| Video | StatQuest — Decision Trees, Random Forest, ROC-AUC (YouTube) | Best visual explanation of how trees split, bagging works, and what ROC-AUC measures. |
| Course | Kaggle ML Explainability Course — kaggle.com/learn/machine-learning-explainability | SHAP values, permutation importance, partial dependence plots. Free, interactive. |
| Docs | Scikit-learn Supervised Learning — scikit-learn.org/stable/supervised_learning.html | Complete parameters for all classifiers. Authoritative reference. |
| Dataset | Heart Disease Dataset — UCI ML Repository | Binary classification, medical features, real clinical relevance. Great for recall/precision analysis. |
| Dataset | Titanic — Kaggle | Classic binary classification benchmark with a well-documented competitive leaderboard. |
Build a medical classification pipeline with full interpretability.
Requirements
- EDA — class balance, feature distributions by class, correlation matrix
- Baseline — Logistic Regression with StandardScaler. Report accuracy, precision, recall, F1, ROC-AUC.
- Decision Tree — tune max_depth with cross-validation. Visualise tree with plot_tree.
- Random Forest — tune n_estimators, max_features. Use oob_score. Report feature importances.
- Class imbalance — apply class_weight="balanced". Compare F1 with and without.
- SHAP analysis — summary plot, waterfall plot for a correctly and incorrectly classified patient
- Results table — compare all models on accuracy, F1, ROC-AUC
Goal: Achieve ROC-AUC > 0.90. What clinical insight do the SHAP values provide?
Classification Metrics Deep Dive
Bias-Variance with Decision Trees
P3-M08 MASTERY CHECKLIST
- Can explain how a decision tree splits: Gini impurity, information gain
- Can visualise a trained decision tree and trace a single prediction path
- Know max_depth is the primary overfitting control for trees
- Can explain Random Forest: bagging + feature subsampling + majority vote
- Can use oob_score=True as a free validation metric
- Know SVM requires feature scaling and that C controls margin width
- Can compute TP, FP, FN, TN from a confusion matrix
- Can compute accuracy, precision, recall, F1, ROC-AUC and know when to use each
- Know that accuracy is misleading for imbalanced datasets
- Can use class_weight="balanced" and SMOTE for imbalanced data
- Can tune the classification threshold using the precision-recall curve
- Can extract and visualise tree feature importances and permutation importances
- Can generate and interpret SHAP summary and waterfall plots
- Completed project: Heart Disease classifier with SHAP analysis and model comparison table
✅ When complete: Move to P3-M09 — Ensembles: XGBoost, LightGBM, SMOTE, Optuna hyperparameter tuning.