P3-M07 - Regression: Linear, Ridge, Lasso & Polynomial

Part 3 — Classical ML · Module 7 of 28

Regression: Linear, Ridge, Lasso & Polynomial

Predict continuous outcomes — from first principles to regularised production models

⏱ 2 Weeks 🟡 Intermediate 🔧 scikit-learn · numpy · statsmodels 📋 Prerequisite: P2-M06

🎯

What This Module Covers

Part 3 Start

Regression predicts a continuous output (price, temperature, sales). Linear regression is the foundation — it is interpretable, fast, and a powerful baseline. Regularised variants (Ridge, Lasso) prevent overfitting on high-dimensional data and remain competitive with tree models when features are well-engineered.

Linear Regression — ordinary least squares, the normal equation, assumptions, residual analysis
Ridge (L2 regularisation) — shrinks coefficients, handles multicollinearity, keeps all features
Lasso (L1 regularisation) — automatic feature selection, drives irrelevant features to zero
Elastic Net — combines L1 and L2, best for correlated feature groups
Polynomial Regression — capturing non-linear relationships with linear models
Regression metrics — MAE, MSE, RMSE, R², MAPE — when to use each

💡 Always start with linear regression. It is your baseline. If a linear model gets R²=0.85, you need a very good reason to use a complex model. If R²=0.40, explore feature engineering or non-linear models. Linear models are also fully interpretable — you can explain every prediction to a non-technical stakeholder.

📐

Linear Regression — First Principles to sklearn

Foundation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import root_mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# ── Concept: linear regression minimises SSR ─────────
# ŷ = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
# OLS finds β values that minimise Σ(y - ŷ)²
# Normal equation: β = (XᵀX)⁻¹Xᵀy (closed form for small n)
# Gradient descent: update β iteratively (used for large datasets)

# ── sklearn implementation ────────────────────────────
X = df[["GrLivArea", "OverallQual", "YearBuilt", "TotalBath"]].fillna(0)
y = df["SalePrice"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  LinearRegression()),
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

rmse = root_mean_squared_error(y_test, y_pred)
r2   = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:,.0f}")
print(f"Test R²:   {r2:.4f}")

# CV score
cv_r2 = cross_val_score(pipe, X, y, cv=5, scoring="r2")
print(f"CV R²: {cv_r2.mean():.3f} ± {cv_r2.std():.3f}")

# ── Coefficients — interpretability ──────────────────
lr = pipe.named_steps["model"]
coef_df = pd.DataFrame({
    "feature": X.columns,
    "coefficient": lr.coef_
}).sort_values("coefficient", ascending=False)
print(coef_df)
# Each coefficient: "holding all other features constant,
# increasing this feature by 1 (scaled unit) changes price by X dollars"

# ── Residual analysis ─────────────────────────────────
residuals = y_test - y_pred
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(y_pred, residuals, alpha=0.3, s=10)
axes[0].axhline(0, color="red", linewidth=1)
axes[0].set(xlabel="Predicted", ylabel="Residual", title="Residuals vs Fitted")
axes[1].hist(residuals, bins=40)
axes[1].set(xlabel="Residual", title="Residual Distribution")
# Ideal: residuals randomly scattered around zero, normally distributed

Linear Regression Assumptions

Linearity — relationship between features and target is linear
Independence — observations are independent (violated in time series)
Homoscedasticity — residuals have constant variance (fan-shaped residuals = violated)
Normality of residuals — for inference/confidence intervals (not needed for just prediction)
No perfect multicollinearity — no feature is a perfect linear combination of others

🛡

Ridge Regression (L2) — Tame Multicollinearity

Regularisation

from sklearn.linear_model import Ridge, RidgeCV
from sklearn.model_selection import cross_val_score
import numpy as np

# ── Ridge adds L2 penalty to OLS ─────────────────────
# Minimises: Σ(y - ŷ)² + α * Σβᵢ²
# α (alpha): regularisation strength
#   α = 0: same as linear regression (no regularisation)
#   α → ∞: all coefficients shrink toward zero
# Effect: coefficients shrink proportionally (none become exactly zero)
# Best for: multicollinear features, more features than samples

alphas = [0.01, 0.1, 1.0, 10, 100, 1000]

# Manual grid search with cross-validation
results = []
for alpha in alphas:
    pipe = Pipeline([("scaler", StandardScaler()),
                     ("model", Ridge(alpha=alpha))])
    cv_r2 = cross_val_score(pipe, X_train, y_train, cv=5, scoring="r2")
    results.append({"alpha": alpha, "r2_mean": cv_r2.mean(), "r2_std": cv_r2.std()})

import pandas as pd
results_df = pd.DataFrame(results)
print(results_df)
best_alpha = results_df.loc[results_df["r2_mean"].idxmax(), "alpha"]
print(f"Best alpha: {best_alpha}")

# Built-in CV (faster)
alphas_to_try = np.logspace(-3, 4, 50)  # log-spaced from 0.001 to 10,000
ridge_cv = RidgeCV(alphas=alphas_to_try, cv=5, scoring="r2")
ridge_cv.fit(StandardScaler().fit_transform(X_train), y_train)
print(f"RidgeCV best alpha: {ridge_cv.alpha_:.4f}")
print(f"RidgeCV R²: {ridge_cv.score(StandardScaler().fit_transform(X_test), y_test):.4f}")

# ── Visualise coefficient shrinkage ──────────────────
fig, ax = plt.subplots(figsize=(10, 5))
coefs = []
for alpha in np.logspace(-3, 4, 100):
    r = Ridge(alpha=alpha)
    r.fit(StandardScaler().fit_transform(X_train), y_train)
    coefs.append(r.coef_)
coefs = np.array(coefs)
for i, name in enumerate(X.columns):
    ax.plot(np.logspace(-3, 4, 100), coefs[:, i], label=name)
ax.set_xscale("log")
ax.set(xlabel="Alpha (log scale)", ylabel="Coefficient Value",
       title="Ridge: Coefficient Shrinkage")
ax.axvline(best_alpha, color="red", linestyle="--", label="Best alpha")
ax.legend()
plt.tight_layout()

💡 Ridge is the default choice for linear regression with many features. When your features include correlated columns (which they always do in real datasets), OLS produces unstable, high-variance coefficients. Ridge's L2 penalty distributes the coefficient weight across correlated features, producing stable predictions even when features are collinear.

✂

Lasso Regression (L1) — Built-In Feature Selection

Sparse Models

from sklearn.linear_model import Lasso, LassoCV
import matplotlib.pyplot as plt
import numpy as np

# ── Lasso adds L1 penalty ────────────────────────────
# Minimises: Σ(y - ŷ)² + α * Σ|βᵢ|
# Key property: L1 penalty drives SOME coefficients to EXACTLY zero
# This is automatic feature selection — irrelevant features are zeroed out
# Drawback: when features are correlated, Lasso picks one arbitrarily

# Find best alpha via cross-validation
alphas_lasso = np.logspace(-4, 1, 50)

from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s  = scaler.transform(X_test)

lasso_cv = LassoCV(alphas=alphas_lasso, cv=5, max_iter=10000, random_state=42)
lasso_cv.fit(X_train_s, y_train)
print(f"Best alpha: {lasso_cv.alpha_:.6f}")
print(f"Test R²: {lasso_cv.score(X_test_s, y_test):.4f}")

# ── Feature selection: which features were zeroed? ────
coefs = pd.Series(lasso_cv.coef_, index=X.columns)
selected = coefs[coefs != 0].sort_values(ascending=False)
zeroed   = coefs[coefs == 0]
print(f"Selected features: {len(selected)} / {len(coefs)}")
print(f"Zeroed features:   {len(zeroed)}")
print("\nTop 10 selected features:")
print(selected.head(10))

# ── Visualise Lasso path ──────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

coef_path = []
for alpha in alphas_lasso:
    l = Lasso(alpha=alpha, max_iter=10000)
    l.fit(X_train_s, y_train)
    coef_path.append(l.coef_)

coef_path = np.array(coef_path)
for i, name in enumerate(X.columns[:10]):
    axes[0].plot(alphas_lasso, coef_path[:, i], label=name)
axes[0].set_xscale("log")
axes[0].axvline(lasso_cv.alpha_, color="red", linestyle="--")
axes[0].set(xlabel="Alpha", ylabel="Coefficient", title="Lasso Path (top 10 features)")
axes[0].legend(fontsize=7)

# Non-zero coefficients
selected.plot(kind="barh", ax=axes[1])
axes[1].set(title="Lasso Selected Features (non-zero coefficients)")
plt.tight_layout()

Ridge vs Lasso: Decision Guide

Property	Ridge (L2)	Lasso (L1)
Penalty	β² (squared)	\|β\| (absolute)
Coefficients	Shrink toward zero, none exactly zero	Some become exactly zero (sparse)
Feature selection	No (keeps all features)	Yes (automatic)
Correlated features	Shares weight across correlated group	Picks one, zeros others
Best for	All features likely relevant, multicollinearity	Many irrelevant features, need interpretability

🔀

Elastic Net and Logistic Regression

Best of Both

from sklearn.linear_model import ElasticNet, ElasticNetCV, LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

# ── Elastic Net: combines L1 + L2 ────────────────────
# Penalty: α * [l1_ratio * Σ|βᵢ| + (1 - l1_ratio) * Σβᵢ²]
# l1_ratio = 1.0: pure Lasso
# l1_ratio = 0.0: pure Ridge
# l1_ratio = 0.5: equal mix (default starting point)
#
# Use Elastic Net when: features are correlated AND you want sparsity
# It produces sparse models but handles correlated groups better than Lasso

en_cv = ElasticNetCV(l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 1.0],
                      cv=5, max_iter=10000, random_state=42)
en_cv.fit(X_train_scaled, y_train)
print(f"Best alpha: {en_cv.alpha_:.6f}")
print(f"Best l1_ratio: {en_cv.l1_ratio_}")
print(f"Non-zero features: {(en_cv.coef_ != 0).sum()}")

# ── Logistic Regression: binary classification ────────
# Despite the name, this is a classification model
# Uses sigmoid function to output probabilities in [0, 1]
# P(y=1|x) = 1 / (1 + e^(-xβ))
# Decision boundary: predict class 1 if P(y=1|x) > 0.5
# Naturally regularised: C = 1/α (higher C = less regularisation)

from sklearn.datasets import load_breast_cancer
X_c, y_c = load_breast_cancer(return_X_y=True)

lr_pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("model",  LogisticRegression(C=1.0, max_iter=1000, random_state=42))
])
lr_pipe.fit(X_train_c, y_train_c)

# Probabilities (not just class labels)
proba = lr_pipe.predict_proba(X_test_c)[:, 1]  # P(class=1)
print(f"Test Accuracy: {lr_pipe.score(X_test_c, y_test_c):.4f}")

# Multi-class: multinomial or one-vs-rest
lr_multi = LogisticRegression(multi_class="multinomial",
                               solver="lbfgs", max_iter=1000)

# ── Regularisation for Logistic Regression ────────────
# penalty="l2" (default): Ridge regularisation
# penalty="l1" (need solver="liblinear" or "saga"): Lasso = feature selection
# penalty="elasticnet" (need solver="saga"): Elastic Net

📈

Polynomial Regression — Non-Linear Relationships

Extensions

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
import matplotlib.pyplot as plt

# ── Polynomial features expand input space ────────────
# For 1 feature x: degree=2 adds [1, x, x²]
# For 2 features x, y: degree=2 adds [1, x, y, x², xy, y²]
# The model is still LINEAR in parameters (just in a higher-dim space)

# Example: 1D regression to visualise
np.random.seed(42)
X_1d = np.linspace(0, 10, 100).reshape(-1, 1)
y_curve = np.sin(X_1d).ravel() + 0.2 * np.random.randn(100)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, degree in enumerate([1, 3, 10]):
    pipe = Pipeline([
        ("poly",  PolynomialFeatures(degree=degree, include_bias=False)),
        ("ridge", Ridge(alpha=0.01)),
    ])
    pipe.fit(X_1d, y_curve)
    y_fit = pipe.predict(X_1d)
    axes[i].scatter(X_1d, y_curve, s=10, alpha=0.5)
    axes[i].plot(X_1d, y_fit, color="red", linewidth=2)
    axes[i].set_title(f"Degree {degree}")
plt.suptitle("Polynomial Regression: Underfitting → Overfitting")

# ── For tabular data: selective polynomial features ───
# Adding degree=2 to ALL features: explodes dimensionality
# n features → n + n*(n+1)/2 features with degree=2
# 50 features → 1325 features (often too many)
# Better: add polynomial terms only for key features

from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import ColumnTransformer

# Add squared term for GrLivArea only
poly_features = ["GrLivArea", "TotalBsmtSF"]
poly_pipe = Pipeline([
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("ridge", Ridge(alpha=10.0))
])

cv_linear = cross_val_score(
    Pipeline([("scale", StandardScaler()), ("model", Ridge(alpha=10.0))]),
    X_train, y_train, cv=5, scoring="r2"
).mean()

cv_poly = cross_val_score(
    Pipeline([("poly", PolynomialFeatures(degree=2)), ("scale", StandardScaler()),
              ("model", Ridge(alpha=10.0))]),
    X_train[poly_features], y_train, cv=5, scoring="r2"
).mean()

print(f"Linear R²:     {cv_linear:.4f}")
print(f"Polynomial R²: {cv_poly:.4f}")

⚠️ Polynomial features are a double-edged sword. Degree=2 on 50 features creates 1,325 features; degree=3 creates 23,426. This causes extreme overfitting unless you add strong regularisation (Ridge with large alpha). Always validate with cross-validation and compare to the linear baseline before committing to polynomial expansion.

📊

Regression Metrics — Choosing the Right Score

Evaluation

from sklearn.metrics import (mean_absolute_error, mean_squared_error,
                             root_mean_squared_error, r2_score,
                             mean_absolute_percentage_error)
import numpy as np

y_true = y_test
y_pred = pipeline.predict(X_test)

# ── MAE: Mean Absolute Error ──────────────────────────
# Average absolute difference. Same units as target.
# Robust to outliers (vs MSE/RMSE which square errors)
# Easy to explain: "on average, predictions are off by $X"
mae = mean_absolute_error(y_true, y_pred)
print(f"MAE:  ${mae:,.0f}")

# ── MSE: Mean Squared Error ───────────────────────────
# Squares errors → large errors penalised more heavily
# NOT in target units (hard to interpret directly)
# Used as training loss (differentiable)
mse = mean_squared_error(y_true, y_pred)
print(f"MSE:  ${mse:,.0f}")

# ── RMSE: Root Mean Squared Error ────────────────────
# sqrt(MSE) — back in target units
# Most common metric for regression
# More sensitive to large errors than MAE
rmse = root_mean_squared_error(y_true, y_pred)
print(f"RMSE: ${rmse:,.0f}")

# ── R²: Coefficient of Determination ─────────────────
# Proportion of variance explained: R² = 1 - SS_res/SS_tot
# R² = 1.0: perfect prediction
# R² = 0.0: as good as predicting the mean (useless model)
# R² < 0.0: WORSE than predicting the mean (seriously bad)
r2 = r2_score(y_true, y_pred)
print(f"R²:   {r2:.4f}")

# ── MAPE: Mean Absolute Percentage Error ─────────────
# Percentage error. Intuitive but problematic near zero.
# "Predictions are X% off on average"
mape = mean_absolute_percentage_error(y_true, y_pred)
print(f"MAPE: {mape:.2%}")

# ── When to use each ──────────────────────────────────
print("""
MAE:  Use when outliers exist and you want robust metric
      Use to communicate to stakeholders ("off by $X")
RMSE: Use when large errors are unacceptable (safety, medical)
      Standard metric for Kaggle regression competitions
R²:   Use to compare models on same dataset
      Useful relative metric: 0.85 >> 0.70
MAPE: Use for business reporting when % errors matter
      AVOID when target can be near zero
""")

FREE LEARNING RESOURCES

Type	Resource	Best For
Video	StatQuest — Linear Regression, Ridge, Lasso (YouTube)	Best visual intuition for what regularisation actually does to coefficients. Highly recommended.
Course	Kaggle Intro to ML Course — kaggle.com/learn/intro-to-machine-learning	Practical sklearn regression from scratch. Includes real Kaggle competition exercises.
Docs	Scikit-learn Supervised Learning Guide — scikit-learn.org/stable/supervised_learning.html	Complete reference for all linear models, decision trees, SVMs with parameters explained.
Book	Hands-On ML (Free Chapter 4) — github.com/ageron/handson-ml3	Chapter 4 covers linear regression, polynomial, regularisation with excellent visualisations.
Dataset	House Prices — Kaggle	Standard regression benchmark. Compare your model against the public leaderboard.

🛠House Price Predictor — From Linear to Ridge to Lasso[Intermediate] 5–6 days

Build a progression of regression models and compare them systematically.

Requirements

Baseline — Linear Regression with raw features. 5-fold CV R² and RMSE.
Engineered features — add TotalSF, HouseAge, TotalBath, QualArea. Does CV R² improve?
Ridge — use RidgeCV to find optimal alpha. Compare to baseline.
Lasso — use LassoCV. How many features are zeroed? Are the zeroed features meaningful?
Target transform — predict log(SalePrice), compare RMSE after expm1() inversion
Results table — DataFrame comparing all models: CV R², CV RMSE, n_features used
Leaderboard — submit to Kaggle and report your score

LAB 1

Linear Regression Internals

Implement linear regression from scratch using the normal equation β = (XᵀX)⁻¹Xᵀy with numpy. Apply to a 2-feature subset. Compare coefficients to sklearn's output.

Plot residuals vs fitted values. Do residuals show a fan shape (heteroscedasticity)? What does this mean for the validity of your p-values?

Apply log transform to SalePrice. Refit the model. Compare: raw RMSE vs RMSE after converting log predictions back to price. Which model has better absolute dollar accuracy?

LAB 2

Ridge vs Lasso Comparison

Build both RidgeCV and LassoCV models with the same 50-feature set. Compare: CV R², number of non-zero features, and the 5 most important features in each model.

Plot coefficient values for both models side-by-side on a bar chart. Which features does Lasso zero out that Ridge keeps? Are they the most correlated features (multicollinear)?

Add 20 random noise features (np.random.randn) to the dataset. Refit Ridge and Lasso. Does Lasso zero out the noise features? Does Ridge shrink them or keep them?

P3-M07 MASTERY CHECKLIST

Can explain what OLS minimises (sum of squared residuals) and why
Can interpret regression coefficients: "holding all else constant, 1-unit increase in X changes y by β"
Can plot and interpret residuals vs fitted values (checking homoscedasticity)
Know Ridge adds L2 (β²) penalty: shrinks all coefficients, none to zero
Know Lasso adds L1 (|β|) penalty: drives some coefficients to exactly zero (feature selection)
Can use RidgeCV and LassoCV to find optimal alpha via cross-validation
Can visualise coefficient shrinkage path across alpha values
Know Elastic Net combines L1+L2 and when to prefer it over Ridge or Lasso
Can apply PolynomialFeatures and understand the dimensionality explosion risk
Can compute MAE, RMSE, R², and MAPE and know when to use each
Know that R² < 0 means the model is worse than predicting the mean
Can build and evaluate a Pipeline with preprocessing + regularised regression
Completed project: House Prices with Linear, Ridge, Lasso compared in a results table

✅ When complete: Move to P3-M08 — Classification: logistic regression, decision trees, random forest, and SVM.

← P2-M06: ML Workflow 🗺️ All Modules Next: P3-M08 — Classification →