P3-M08 - Classification: Decision Trees, Random Forest & SVM

Part 3 — Classical ML · Module 8 of 28

Classification: Decision Trees, Random Forest & SVM

Predict discrete categories — algorithms, metrics, thresholds, and imbalanced data

⏱ 2 Weeks 🟡 Intermediate 🔧 scikit-learn · imbalanced-learn · shap 📋 Prerequisite: P3-M07

🎯

What This Module Covers

Part 3

Classification predicts which discrete category an input belongs to: spam or not spam, disease or healthy, which product category. This module covers the most important classical classification algorithms and — critically — how to evaluate them correctly.

Decision Trees — splitting logic, Gini vs entropy, depth control, overfitting
Random Forest — bagging, feature subsampling, out-of-bag score, tuning
SVM (Support Vector Machine) — maximum margin classifier, kernels, C parameter
Classification metrics — accuracy, precision, recall, F1, ROC-AUC, confusion matrix
Imbalanced data — class weights, SMOTE, threshold tuning, F1 vs accuracy
Feature importance — tree-based, permutation importance, SHAP values

💡 Accuracy is almost never the right metric. If 99% of transactions are legitimate and 1% are fraud, a model that predicts "legitimate" for everything gets 99% accuracy — and catches zero fraud. Use precision, recall, F1, and ROC-AUC for imbalanced problems.

🌳

Decision Trees — Interpretable Splitting

Foundation

from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
import numpy as np

X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                      stratify=y, random_state=42)

# ── Basic decision tree ───────────────────────────────
# criterion="gini": Gini impurity (default, slightly faster)
# criterion="entropy": information gain (often similar results)
# max_depth: most important hyperparameter. None = overfit!
# min_samples_split: minimum samples needed to split a node
# min_samples_leaf: minimum samples required in a leaf node

dt = DecisionTreeClassifier(
    max_depth=4,           # limit depth to prevent overfitting
    min_samples_leaf=5,    # at least 5 samples per leaf
    criterion="gini",
    random_state=42
)
dt.fit(X_train, y_train)
print(f"Train accuracy: {dt.score(X_train, y_train):.4f}")
print(f"Test accuracy:  {dt.score(X_test, y_test):.4f}")

# ── Visualise the tree ────────────────────────────────
fig, ax = plt.subplots(figsize=(20, 8))
plot_tree(dt, feature_names=X.columns, class_names=["malignant", "benign"],
          filled=True, rounded=True, ax=ax, fontsize=8)
plt.tight_layout()
plt.savefig("decision_tree.png", dpi=100)

# Text representation (shareable without plot)
print(export_text(dt, feature_names=list(X.columns)))

# ── Overfitting demonstration ─────────────────────────
train_scores, test_scores = [], []
depths = range(1, 20)
for d in depths:
    dt_d = DecisionTreeClassifier(max_depth=d, random_state=42)
    dt_d.fit(X_train, y_train)
    train_scores.append(dt_d.score(X_train, y_train))
    test_scores.append(dt_d.score(X_test, y_test))

plt.figure(figsize=(8, 4))
plt.plot(depths, train_scores, label="Train", marker="o", markersize=4)
plt.plot(depths, test_scores,  label="Test",  marker="s", markersize=4)
plt.xlabel("Max Depth")
plt.ylabel("Accuracy")
plt.title("Decision Tree: Depth vs Accuracy (Bias-Variance Tradeoff)")
plt.legend()
plt.axvline(depths[np.argmax(test_scores)], color="red", linestyle="--",
            label=f"Best depth={depths[np.argmax(test_scores)]}")

💡 Decision trees with no depth limit overfit perfectly. They memorise every training example. The tree depth is the primary bias-variance dial: shallow = high bias (underfitting), deep = high variance (overfitting). max_depth=4 is a good starting point. Always compare train vs test accuracy — a gap of more than 5% suggests overfitting.

🌲

Random Forest — Ensemble of Trees

Ensemble

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, GridSearchCV
import pandas as pd

# ── How Random Forest works ───────────────────────────
# 1. Bootstrap: sample N rows WITH replacement (bagging)
# 2. Feature subsampling: at each split, only consider sqrt(n_features) features
# 3. Train one deep tree per bootstrap sample
# 4. Predict: majority vote of all trees (reduces variance)
# Result: lower variance than single tree, still low bias

rf = RandomForestClassifier(
    n_estimators=100,      # number of trees (more = better until diminishing returns)
    max_depth=None,        # trees are grown deep (bagging reduces variance)
    max_features="sqrt",   # sqrt(n_features) features per split (default for clf)
    min_samples_leaf=1,    # default for RF (deep trees are fine)
    n_jobs=-1,             # use all CPU cores
    random_state=42,
    oob_score=True         # out-of-bag evaluation (free validation!)
)
rf.fit(X_train, y_train)

print(f"OOB Score:    {rf.oob_score_:.4f}")   # no held-out set needed!
print(f"Train Acc:    {rf.score(X_train, y_train):.4f}")
print(f"Test Acc:     {rf.score(X_test, y_test):.4f}")

# Cross-validation
cv_scores = cross_val_score(rf, X, y, cv=5, scoring="f1")
print(f"CV F1: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

# ── Hyperparameter tuning ─────────────────────────────
# Key hyperparameters to tune:
# n_estimators: 100-500 (more is almost always better, just slower)
# max_depth: None or 10-30 (deep is OK for RF due to averaging)
# max_features: "sqrt", "log2", or float (0.3 = 30% of features)
# min_samples_leaf: 1-10 (increasing reduces overfitting)

param_grid = {
    "n_estimators": [100, 200],
    "max_features": ["sqrt", 0.3],
    "min_samples_leaf": [1, 3, 5],
}
gs = GridSearchCV(RandomForestClassifier(random_state=42, n_jobs=-1),
                  param_grid, cv=3, scoring="f1", n_jobs=-1, verbose=1)
gs.fit(X_train, y_train)
print(f"Best params: {gs.best_params_}")
print(f"Best CV F1:  {gs.best_score_:.4f}")

💡 oob_score=True gives you a free validation score. Each tree in the forest is trained on ~63% of rows (bootstrap sample). The remaining ~37% (out-of-bag samples) are used to evaluate that tree's predictions — without any held-out set. The OOB score is a reliable estimate of generalisation performance and is fast to compute.

📏

Support Vector Machine (SVM)

Maximum Margin

from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, RandomizedSearchCV
from scipy.stats import loguniform

# ── SVM concept ───────────────────────────────────────
# Finds the hyperplane that maximises the margin between classes
# Support vectors: the training points closest to the decision boundary
# C parameter: tradeoff between margin width and misclassification
#   Small C: wide margin, allows more misclassification (regularised)
#   Large C: narrow margin, fewer misclassifications (overfit risk)
# Kernel: maps data to higher-dimensional space for non-linear boundaries

# ── CRITICAL: SVM requires feature scaling ────────────
svm_pipe = Pipeline([
    ("scaler", StandardScaler()),   # SVM is NOT scale-invariant
    ("svm",    SVC(kernel="rbf", C=1.0, gamma="scale",
                   probability=True, random_state=42)),
])
svm_pipe.fit(X_train, y_train)

print(f"Test Accuracy: {svm_pipe.score(X_test, y_test):.4f}")

# Get probabilities for ROC-AUC
proba = svm_pipe.predict_proba(X_test)[:, 1]

# ── Kernel choice ─────────────────────────────────────
# "linear": good for high-dimensional sparse data (text)
# "rbf": good default for most tabular data (radial basis function)
# "poly": polynomial kernel (degree parameter)

# ── Hyperparameter search ─────────────────────────────
# C and gamma interact — tune together with log-uniform distribution
param_dist = {
    "svm__C":     loguniform(0.01, 1000),  # 0.01 to 1000
    "svm__gamma": loguniform(1e-4, 1.0),   # 1e-4 to 1.0
}
rs = RandomizedSearchCV(svm_pipe, param_dist, n_iter=20, cv=3,
                        scoring="f1", random_state=42)
rs.fit(X_train, y_train)
print(f"Best C:     {rs.best_params_['svm__C']:.4f}")
print(f"Best gamma: {rs.best_params_['svm__gamma']:.6f}")
print(f"Best CV F1: {rs.best_score_:.4f}")

💡 SVM with RBF kernel is often competitive with Random Forest on small-to-medium datasets. It is a strong default when: dataset < 10,000 rows, features are dense (tabular), you need probabilistic outputs. Its weakness: does not scale well to large datasets (O(n²) memory), slow to tune. For >50,000 rows, use tree-based models.

📊

Classification Metrics — Beyond Accuracy

Critical

from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, confusion_matrix,
                             classification_report, RocCurveDisplay,
                             PrecisionRecallDisplay)
import matplotlib.pyplot as plt
import seaborn as sns

y_pred  = model.predict(X_test)
y_proba = model.predict_proba(X_test)[:, 1]

# ── The four classification outcomes ─────────────────
# True Positive  (TP): predicted positive, actually positive
# True Negative  (TN): predicted negative, actually negative
# False Positive (FP): predicted positive, actually negative (Type I error)
# False Negative (FN): predicted negative, actually positive (Type II error)

# ── Core metrics ──────────────────────────────────────
accuracy  = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)  # TP / (TP + FP)
recall    = recall_score(y_test, y_pred)     # TP / (TP + FN) = sensitivity
f1        = f1_score(y_test, y_pred)         # harmonic mean of precision & recall
roc_auc   = roc_auc_score(y_test, y_proba)  # area under ROC curve

print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}   (when I say positive, how often am I right?)")
print(f"Recall:    {recall:.4f}   (how many actual positives did I catch?)")
print(f"F1 Score:  {f1:.4f}   (harmonic mean of precision and recall)")
print(f"ROC-AUC:   {roc_auc:.4f}  (probability correct positive ranked above negative)")

# ── Confusion matrix ──────────────────────────────────
cm = confusion_matrix(y_test, y_pred)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["Pred Neg", "Pred Pos"],
            yticklabels=["Actual Neg", "Actual Pos"], ax=axes[0])
axes[0].set_title("Confusion Matrix")

# ROC curve
RocCurveDisplay.from_estimator(model, X_test, y_test, ax=axes[1])
axes[1].set_title(f"ROC Curve (AUC={roc_auc:.3f})")
plt.tight_layout()

# Full classification report
print(classification_report(y_test, y_pred, target_names=["Neg", "Pos"]))

# ── Threshold tuning ──────────────────────────────────
# Default threshold = 0.5. Adjust based on business need.
# Lower threshold (e.g. 0.3): catch more positives (higher recall, lower precision)
# Higher threshold (e.g. 0.7): more confident positives (higher precision, lower recall)

thresholds = [0.3, 0.4, 0.5, 0.6, 0.7]
for t in thresholds:
    y_t = (y_proba >= t).astype(int)
    print(f"t={t:.1f}: precision={precision_score(y_test, y_t):.3f}, "
          f"recall={recall_score(y_test, y_t):.3f}, "
          f"f1={f1_score(y_test, y_t):.3f}")

⚠️ When to use each metric: Accuracy — only for balanced classes. Precision — when false positives are costly (spam filter: don't block real emails). Recall — when false negatives are costly (cancer screening: don't miss actual cancer). F1 — when both FP and FN matter. ROC-AUC — when you need threshold-independent comparison across models.

⚖

Imbalanced Data — When Classes Are Skewed

Critical for Real Data

from sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score, classification_report
import numpy as np

# ── Detect class imbalance ────────────────────────────
print(pd.Series(y_train).value_counts())
print(pd.Series(y_train).value_counts(normalize=True))
# If 95% negative, 5% positive → severe imbalance

# ── Solution 1: class_weight="balanced" (easiest) ────
# Adjusts sample weights so each class contributes equally to loss
# No data modification, no SMOTE complexity
rf_balanced = RandomForestClassifier(
    class_weight="balanced",  # automatically weights minority class higher
    n_estimators=100,
    random_state=42, n_jobs=-1
)
rf_balanced.fit(X_train, y_train)
print(f"F1 (balanced weights): {f1_score(y_test, rf_balanced.predict(X_test)):.4f}")

# Compute weights manually
classes = np.unique(y_train)
weights = compute_class_weight("balanced", classes=classes, y=y_train)
weight_dict = dict(zip(classes, weights))
print(f"Class weights: {weight_dict}")

# ── Solution 2: SMOTE — Synthetic Minority Oversampling ──
# Generates synthetic minority-class samples by interpolating between
# existing minority samples in feature space
# USE: when minority class has <10% representation

smote = SMOTE(random_state=42, k_neighbors=5)
X_res, y_res = smote.fit_resample(X_train, y_train)
print(f"Before SMOTE: {pd.Series(y_train).value_counts().to_dict()}")
print(f"After SMOTE:  {pd.Series(y_res).value_counts().to_dict()}")

# SMOTE MUST be inside a pipeline (not applied to test data)
imb_pipe = ImbPipeline([
    ("smote", SMOTE(random_state=42)),
    ("rf",    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1))
])
from sklearn.model_selection import cross_val_score
cv_f1 = cross_val_score(imb_pipe, X_train, y_train, cv=5, scoring="f1")
print(f"SMOTE Pipeline CV F1: {cv_f1.mean():.3f}")

# ── Solution 3: Threshold tuning ─────────────────────
# Default 0.5 threshold biased toward majority class
# For imbalanced data, optimal threshold is often lower
proba = rf_balanced.predict_proba(X_test)[:, 1]
from sklearn.metrics import precision_recall_curve
prec, rec, thresholds = precision_recall_curve(y_test, proba)
f1_scores = 2 * (prec * rec) / (prec + rec + 1e-9)
best_threshold = thresholds[np.argmax(f1_scores)]
print(f"Optimal threshold: {best_threshold:.3f}")

🔍

Feature Importance and SHAP Values

Interpretability

import shap
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.inspection import permutation_importance

# ── Method 1: Tree-based feature importance ───────────
# Built into sklearn tree models
# Based on total reduction in Gini impurity due to each feature
# WARNING: biased toward high-cardinality features
importances = pd.Series(rf.feature_importances_, index=X.columns)
importances.sort_values(ascending=False).head(15).plot(kind="barh")
plt.title("Random Forest Feature Importance")

# ── Method 2: Permutation Importance ─────────────────
# Shuffle each feature independently, measure performance drop
# More reliable than built-in importance, not biased by cardinality
# SLOW on large datasets

result = permutation_importance(rf, X_test, y_test, n_repeats=10,
                                 random_state=42, scoring="f1")
perm_df = pd.DataFrame({"importance": result.importances_mean,
                          "std": result.importances_std},
                         index=X.columns).sort_values("importance", ascending=False)
print(perm_df.head(10))

# ── Method 3: SHAP Values ─────────────────────────────
# SHapley Additive exPlanations — game-theory-based feature contributions
# Explains EACH individual prediction (not just global importance)
# For tree models: exact (fast); for others: approximation

explainer = shap.TreeExplainer(rf)
shap_values = explainer.shap_values(X_test)

# Summary plot: global feature importance + direction
shap.summary_plot(shap_values[:, :, 1], X_test, plot_type="bar")    # bar
shap.summary_plot(shap_values[:, :, 1], X_test)                      # beeswarm (better)

# Individual prediction explanation
idx = 0  # first test sample
shap.waterfall_plot(shap.Explanation(values=shap_values[1][idx],
                                     base_values=explainer.expected_value[1],
                                     data=X_test.iloc[idx]))

FREE LEARNING RESOURCES

Type	Resource	Best For
Video	StatQuest — Decision Trees, Random Forest, ROC-AUC (YouTube)	Best visual explanation of how trees split, bagging works, and what ROC-AUC measures.
Course	Kaggle ML Explainability Course — kaggle.com/learn/machine-learning-explainability	SHAP values, permutation importance, partial dependence plots. Free, interactive.
Docs	Scikit-learn Supervised Learning — scikit-learn.org/stable/supervised_learning.html	Complete parameters for all classifiers. Authoritative reference.
Dataset	Heart Disease Dataset — UCI ML Repository	Binary classification, medical features, real clinical relevance. Great for recall/precision analysis.
Dataset	Titanic — Kaggle	Classic binary classification benchmark with a well-documented competitive leaderboard.

🛠Heart Disease Classifier with SHAP Analysis[Intermediate] 5–6 days

Build a medical classification pipeline with full interpretability.

Requirements

EDA — class balance, feature distributions by class, correlation matrix
Baseline — Logistic Regression with StandardScaler. Report accuracy, precision, recall, F1, ROC-AUC.
Decision Tree — tune max_depth with cross-validation. Visualise tree with plot_tree.
Random Forest — tune n_estimators, max_features. Use oob_score. Report feature importances.
Class imbalance — apply class_weight="balanced". Compare F1 with and without.
SHAP analysis — summary plot, waterfall plot for a correctly and incorrectly classified patient
Results table — compare all models on accuracy, F1, ROC-AUC

Goal: Achieve ROC-AUC > 0.90. What clinical insight do the SHAP values provide?

LAB 1

Classification Metrics Deep Dive

Build a Random Forest on Titanic. Calculate TP, FP, FN, TN manually from the confusion matrix. Verify they match precision_score() and recall_score() outputs.

Plot the ROC curve. What does the curve represent? Where is the point corresponding to threshold=0.5? Try threshold=0.3 and 0.7 — add those points to the ROC plot.

Using the precision-recall curve, find the threshold that maximises F1. How does this threshold differ from 0.5? In a medical context (predicting disease), would you prefer higher precision or recall?

LAB 2

Bias-Variance with Decision Trees

Train Decision Trees at depths 1, 2, 3, 5, 10, 20, None on Heart Disease. Plot train F1 and test F1 vs depth. Identify the elbow — what depth gives best bias-variance tradeoff?

For the best depth, visualise the tree with plot_tree. Can you explain a single prediction path in plain English? ("For a 65-year-old male with chest pain type X...")

P3-M08 MASTERY CHECKLIST

Can explain how a decision tree splits: Gini impurity, information gain
Can visualise a trained decision tree and trace a single prediction path
Know max_depth is the primary overfitting control for trees
Can explain Random Forest: bagging + feature subsampling + majority vote
Can use oob_score=True as a free validation metric
Know SVM requires feature scaling and that C controls margin width
Can compute TP, FP, FN, TN from a confusion matrix
Can compute accuracy, precision, recall, F1, ROC-AUC and know when to use each
Know that accuracy is misleading for imbalanced datasets
Can use class_weight="balanced" and SMOTE for imbalanced data
Can tune the classification threshold using the precision-recall curve
Can extract and visualise tree feature importances and permutation importances
Can generate and interpret SHAP summary and waterfall plots
Completed project: Heart Disease classifier with SHAP analysis and model comparison table

✅ When complete: Move to P3-M09 — Ensembles: XGBoost, LightGBM, SMOTE, Optuna hyperparameter tuning.

← P3-M07: Regression 🗺️ All Modules Next: P3-M09 — Ensembles →