What This Module Covers
Part 3 StartRegression predicts a continuous output (price, temperature, sales). Linear regression is the foundation — it is interpretable, fast, and a powerful baseline. Regularised variants (Ridge, Lasso) prevent overfitting on high-dimensional data and remain competitive with tree models when features are well-engineered.
- Linear Regression — ordinary least squares, the normal equation, assumptions, residual analysis
- Ridge (L2 regularisation) — shrinks coefficients, handles multicollinearity, keeps all features
- Lasso (L1 regularisation) — automatic feature selection, drives irrelevant features to zero
- Elastic Net — combines L1 and L2, best for correlated feature groups
- Polynomial Regression — capturing non-linear relationships with linear models
- Regression metrics — MAE, MSE, RMSE, R², MAPE — when to use each
💡 Always start with linear regression. It is your baseline. If a linear model gets R²=0.85, you need a very good reason to use a complex model. If R²=0.40, explore feature engineering or non-linear models. Linear models are also fully interpretable — you can explain every prediction to a non-technical stakeholder.
Linear Regression — First Principles to sklearn
Foundationimport numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import root_mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# ── Concept: linear regression minimises SSR ─────────
# ŷ = β₀ + β₁x₁ + β₂x₂ + ... + βₙxₙ
# OLS finds β values that minimise Σ(y - ŷ)²
# Normal equation: β = (XᵀX)⁻¹Xᵀy (closed form for small n)
# Gradient descent: update β iteratively (used for large datasets)
# ── sklearn implementation ────────────────────────────
X = df[["GrLivArea", "OverallQual", "YearBuilt", "TotalBath"]].fillna(0)
y = df["SalePrice"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipe = Pipeline([
("scaler", StandardScaler()),
("model", LinearRegression()),
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:,.0f}")
print(f"Test R²: {r2:.4f}")
# CV score
cv_r2 = cross_val_score(pipe, X, y, cv=5, scoring="r2")
print(f"CV R²: {cv_r2.mean():.3f} ± {cv_r2.std():.3f}")
# ── Coefficients — interpretability ──────────────────
lr = pipe.named_steps["model"]
coef_df = pd.DataFrame({
"feature": X.columns,
"coefficient": lr.coef_
}).sort_values("coefficient", ascending=False)
print(coef_df)
# Each coefficient: "holding all other features constant,
# increasing this feature by 1 (scaled unit) changes price by X dollars"
# ── Residual analysis ─────────────────────────────────
residuals = y_test - y_pred
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(y_pred, residuals, alpha=0.3, s=10)
axes[0].axhline(0, color="red", linewidth=1)
axes[0].set(xlabel="Predicted", ylabel="Residual", title="Residuals vs Fitted")
axes[1].hist(residuals, bins=40)
axes[1].set(xlabel="Residual", title="Residual Distribution")
# Ideal: residuals randomly scattered around zero, normally distributedLinear Regression Assumptions
- Linearity — relationship between features and target is linear
- Independence — observations are independent (violated in time series)
- Homoscedasticity — residuals have constant variance (fan-shaped residuals = violated)
- Normality of residuals — for inference/confidence intervals (not needed for just prediction)
- No perfect multicollinearity — no feature is a perfect linear combination of others
Ridge Regression (L2) — Tame Multicollinearity
Regularisationfrom sklearn.linear_model import Ridge, RidgeCV
from sklearn.model_selection import cross_val_score
import numpy as np
# ── Ridge adds L2 penalty to OLS ─────────────────────
# Minimises: Σ(y - ŷ)² + α * Σβᵢ²
# α (alpha): regularisation strength
# α = 0: same as linear regression (no regularisation)
# α → ∞: all coefficients shrink toward zero
# Effect: coefficients shrink proportionally (none become exactly zero)
# Best for: multicollinear features, more features than samples
alphas = [0.01, 0.1, 1.0, 10, 100, 1000]
# Manual grid search with cross-validation
results = []
for alpha in alphas:
pipe = Pipeline([("scaler", StandardScaler()),
("model", Ridge(alpha=alpha))])
cv_r2 = cross_val_score(pipe, X_train, y_train, cv=5, scoring="r2")
results.append({"alpha": alpha, "r2_mean": cv_r2.mean(), "r2_std": cv_r2.std()})
import pandas as pd
results_df = pd.DataFrame(results)
print(results_df)
best_alpha = results_df.loc[results_df["r2_mean"].idxmax(), "alpha"]
print(f"Best alpha: {best_alpha}")
# Built-in CV (faster)
alphas_to_try = np.logspace(-3, 4, 50) # log-spaced from 0.001 to 10,000
ridge_cv = RidgeCV(alphas=alphas_to_try, cv=5, scoring="r2")
ridge_cv.fit(StandardScaler().fit_transform(X_train), y_train)
print(f"RidgeCV best alpha: {ridge_cv.alpha_:.4f}")
print(f"RidgeCV R²: {ridge_cv.score(StandardScaler().fit_transform(X_test), y_test):.4f}")
# ── Visualise coefficient shrinkage ──────────────────
fig, ax = plt.subplots(figsize=(10, 5))
coefs = []
for alpha in np.logspace(-3, 4, 100):
r = Ridge(alpha=alpha)
r.fit(StandardScaler().fit_transform(X_train), y_train)
coefs.append(r.coef_)
coefs = np.array(coefs)
for i, name in enumerate(X.columns):
ax.plot(np.logspace(-3, 4, 100), coefs[:, i], label=name)
ax.set_xscale("log")
ax.set(xlabel="Alpha (log scale)", ylabel="Coefficient Value",
title="Ridge: Coefficient Shrinkage")
ax.axvline(best_alpha, color="red", linestyle="--", label="Best alpha")
ax.legend()
plt.tight_layout()💡 Ridge is the default choice for linear regression with many features. When your features include correlated columns (which they always do in real datasets), OLS produces unstable, high-variance coefficients. Ridge's L2 penalty distributes the coefficient weight across correlated features, producing stable predictions even when features are collinear.
Lasso Regression (L1) — Built-In Feature Selection
Sparse Modelsfrom sklearn.linear_model import Lasso, LassoCV
import matplotlib.pyplot as plt
import numpy as np
# ── Lasso adds L1 penalty ────────────────────────────
# Minimises: Σ(y - ŷ)² + α * Σ|βᵢ|
# Key property: L1 penalty drives SOME coefficients to EXACTLY zero
# This is automatic feature selection — irrelevant features are zeroed out
# Drawback: when features are correlated, Lasso picks one arbitrarily
# Find best alpha via cross-validation
alphas_lasso = np.logspace(-4, 1, 50)
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
lasso_cv = LassoCV(alphas=alphas_lasso, cv=5, max_iter=10000, random_state=42)
lasso_cv.fit(X_train_s, y_train)
print(f"Best alpha: {lasso_cv.alpha_:.6f}")
print(f"Test R²: {lasso_cv.score(X_test_s, y_test):.4f}")
# ── Feature selection: which features were zeroed? ────
coefs = pd.Series(lasso_cv.coef_, index=X.columns)
selected = coefs[coefs != 0].sort_values(ascending=False)
zeroed = coefs[coefs == 0]
print(f"Selected features: {len(selected)} / {len(coefs)}")
print(f"Zeroed features: {len(zeroed)}")
print("\nTop 10 selected features:")
print(selected.head(10))
# ── Visualise Lasso path ──────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
coef_path = []
for alpha in alphas_lasso:
l = Lasso(alpha=alpha, max_iter=10000)
l.fit(X_train_s, y_train)
coef_path.append(l.coef_)
coef_path = np.array(coef_path)
for i, name in enumerate(X.columns[:10]):
axes[0].plot(alphas_lasso, coef_path[:, i], label=name)
axes[0].set_xscale("log")
axes[0].axvline(lasso_cv.alpha_, color="red", linestyle="--")
axes[0].set(xlabel="Alpha", ylabel="Coefficient", title="Lasso Path (top 10 features)")
axes[0].legend(fontsize=7)
# Non-zero coefficients
selected.plot(kind="barh", ax=axes[1])
axes[1].set(title="Lasso Selected Features (non-zero coefficients)")
plt.tight_layout()Ridge vs Lasso: Decision Guide
| Property | Ridge (L2) | Lasso (L1) |
|---|---|---|
| Penalty | β² (squared) | |β| (absolute) |
| Coefficients | Shrink toward zero, none exactly zero | Some become exactly zero (sparse) |
| Feature selection | No (keeps all features) | Yes (automatic) |
| Correlated features | Shares weight across correlated group | Picks one, zeros others |
| Best for | All features likely relevant, multicollinearity | Many irrelevant features, need interpretability |
Elastic Net and Logistic Regression
Best of Bothfrom sklearn.linear_model import ElasticNet, ElasticNetCV, LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
# ── Elastic Net: combines L1 + L2 ────────────────────
# Penalty: α * [l1_ratio * Σ|βᵢ| + (1 - l1_ratio) * Σβᵢ²]
# l1_ratio = 1.0: pure Lasso
# l1_ratio = 0.0: pure Ridge
# l1_ratio = 0.5: equal mix (default starting point)
#
# Use Elastic Net when: features are correlated AND you want sparsity
# It produces sparse models but handles correlated groups better than Lasso
en_cv = ElasticNetCV(l1_ratio=[0.1, 0.3, 0.5, 0.7, 0.9, 1.0],
cv=5, max_iter=10000, random_state=42)
en_cv.fit(X_train_scaled, y_train)
print(f"Best alpha: {en_cv.alpha_:.6f}")
print(f"Best l1_ratio: {en_cv.l1_ratio_}")
print(f"Non-zero features: {(en_cv.coef_ != 0).sum()}")
# ── Logistic Regression: binary classification ────────
# Despite the name, this is a classification model
# Uses sigmoid function to output probabilities in [0, 1]
# P(y=1|x) = 1 / (1 + e^(-xβ))
# Decision boundary: predict class 1 if P(y=1|x) > 0.5
# Naturally regularised: C = 1/α (higher C = less regularisation)
from sklearn.datasets import load_breast_cancer
X_c, y_c = load_breast_cancer(return_X_y=True)
lr_pipe = Pipeline([
("scaler", StandardScaler()),
("model", LogisticRegression(C=1.0, max_iter=1000, random_state=42))
])
lr_pipe.fit(X_train_c, y_train_c)
# Probabilities (not just class labels)
proba = lr_pipe.predict_proba(X_test_c)[:, 1] # P(class=1)
print(f"Test Accuracy: {lr_pipe.score(X_test_c, y_test_c):.4f}")
# Multi-class: multinomial or one-vs-rest
lr_multi = LogisticRegression(multi_class="multinomial",
solver="lbfgs", max_iter=1000)
# ── Regularisation for Logistic Regression ────────────
# penalty="l2" (default): Ridge regularisation
# penalty="l1" (need solver="liblinear" or "saga"): Lasso = feature selection
# penalty="elasticnet" (need solver="saga"): Elastic NetPolynomial Regression — Non-Linear Relationships
Extensionsfrom sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
import matplotlib.pyplot as plt
# ── Polynomial features expand input space ────────────
# For 1 feature x: degree=2 adds [1, x, x²]
# For 2 features x, y: degree=2 adds [1, x, y, x², xy, y²]
# The model is still LINEAR in parameters (just in a higher-dim space)
# Example: 1D regression to visualise
np.random.seed(42)
X_1d = np.linspace(0, 10, 100).reshape(-1, 1)
y_curve = np.sin(X_1d).ravel() + 0.2 * np.random.randn(100)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for i, degree in enumerate([1, 3, 10]):
pipe = Pipeline([
("poly", PolynomialFeatures(degree=degree, include_bias=False)),
("ridge", Ridge(alpha=0.01)),
])
pipe.fit(X_1d, y_curve)
y_fit = pipe.predict(X_1d)
axes[i].scatter(X_1d, y_curve, s=10, alpha=0.5)
axes[i].plot(X_1d, y_fit, color="red", linewidth=2)
axes[i].set_title(f"Degree {degree}")
plt.suptitle("Polynomial Regression: Underfitting → Overfitting")
# ── For tabular data: selective polynomial features ───
# Adding degree=2 to ALL features: explodes dimensionality
# n features → n + n*(n+1)/2 features with degree=2
# 50 features → 1325 features (often too many)
# Better: add polynomial terms only for key features
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import ColumnTransformer
# Add squared term for GrLivArea only
poly_features = ["GrLivArea", "TotalBsmtSF"]
poly_pipe = Pipeline([
("poly", PolynomialFeatures(degree=2, include_bias=False)),
("ridge", Ridge(alpha=10.0))
])
cv_linear = cross_val_score(
Pipeline([("scale", StandardScaler()), ("model", Ridge(alpha=10.0))]),
X_train, y_train, cv=5, scoring="r2"
).mean()
cv_poly = cross_val_score(
Pipeline([("poly", PolynomialFeatures(degree=2)), ("scale", StandardScaler()),
("model", Ridge(alpha=10.0))]),
X_train[poly_features], y_train, cv=5, scoring="r2"
).mean()
print(f"Linear R²: {cv_linear:.4f}")
print(f"Polynomial R²: {cv_poly:.4f}")⚠️ Polynomial features are a double-edged sword. Degree=2 on 50 features creates 1,325 features; degree=3 creates 23,426. This causes extreme overfitting unless you add strong regularisation (Ridge with large alpha). Always validate with cross-validation and compare to the linear baseline before committing to polynomial expansion.
Regression Metrics — Choosing the Right Score
Evaluationfrom sklearn.metrics import (mean_absolute_error, mean_squared_error,
root_mean_squared_error, r2_score,
mean_absolute_percentage_error)
import numpy as np
y_true = y_test
y_pred = pipeline.predict(X_test)
# ── MAE: Mean Absolute Error ──────────────────────────
# Average absolute difference. Same units as target.
# Robust to outliers (vs MSE/RMSE which square errors)
# Easy to explain: "on average, predictions are off by $X"
mae = mean_absolute_error(y_true, y_pred)
print(f"MAE: ${mae:,.0f}")
# ── MSE: Mean Squared Error ───────────────────────────
# Squares errors → large errors penalised more heavily
# NOT in target units (hard to interpret directly)
# Used as training loss (differentiable)
mse = mean_squared_error(y_true, y_pred)
print(f"MSE: ${mse:,.0f}")
# ── RMSE: Root Mean Squared Error ────────────────────
# sqrt(MSE) — back in target units
# Most common metric for regression
# More sensitive to large errors than MAE
rmse = root_mean_squared_error(y_true, y_pred)
print(f"RMSE: ${rmse:,.0f}")
# ── R²: Coefficient of Determination ─────────────────
# Proportion of variance explained: R² = 1 - SS_res/SS_tot
# R² = 1.0: perfect prediction
# R² = 0.0: as good as predicting the mean (useless model)
# R² < 0.0: WORSE than predicting the mean (seriously bad)
r2 = r2_score(y_true, y_pred)
print(f"R²: {r2:.4f}")
# ── MAPE: Mean Absolute Percentage Error ─────────────
# Percentage error. Intuitive but problematic near zero.
# "Predictions are X% off on average"
mape = mean_absolute_percentage_error(y_true, y_pred)
print(f"MAPE: {mape:.2%}")
# ── When to use each ──────────────────────────────────
print("""
MAE: Use when outliers exist and you want robust metric
Use to communicate to stakeholders ("off by $X")
RMSE: Use when large errors are unacceptable (safety, medical)
Standard metric for Kaggle regression competitions
R²: Use to compare models on same dataset
Useful relative metric: 0.85 >> 0.70
MAPE: Use for business reporting when % errors matter
AVOID when target can be near zero
""")FREE LEARNING RESOURCES
| Type | Resource | Best For |
|---|---|---|
| Video | StatQuest — Linear Regression, Ridge, Lasso (YouTube) | Best visual intuition for what regularisation actually does to coefficients. Highly recommended. |
| Course | Kaggle Intro to ML Course — kaggle.com/learn/intro-to-machine-learning | Practical sklearn regression from scratch. Includes real Kaggle competition exercises. |
| Docs | Scikit-learn Supervised Learning Guide — scikit-learn.org/stable/supervised_learning.html | Complete reference for all linear models, decision trees, SVMs with parameters explained. |
| Book | Hands-On ML (Free Chapter 4) — github.com/ageron/handson-ml3 | Chapter 4 covers linear regression, polynomial, regularisation with excellent visualisations. |
| Dataset | House Prices — Kaggle | Standard regression benchmark. Compare your model against the public leaderboard. |
Build a progression of regression models and compare them systematically.
Requirements
- Baseline — Linear Regression with raw features. 5-fold CV R² and RMSE.
- Engineered features — add TotalSF, HouseAge, TotalBath, QualArea. Does CV R² improve?
- Ridge — use RidgeCV to find optimal alpha. Compare to baseline.
- Lasso — use LassoCV. How many features are zeroed? Are the zeroed features meaningful?
- Target transform — predict log(SalePrice), compare RMSE after expm1() inversion
- Results table — DataFrame comparing all models: CV R², CV RMSE, n_features used
- Leaderboard — submit to Kaggle and report your score
Linear Regression Internals
Ridge vs Lasso Comparison
P3-M07 MASTERY CHECKLIST
- Can explain what OLS minimises (sum of squared residuals) and why
- Can interpret regression coefficients: "holding all else constant, 1-unit increase in X changes y by β"
- Can plot and interpret residuals vs fitted values (checking homoscedasticity)
- Know Ridge adds L2 (β²) penalty: shrinks all coefficients, none to zero
- Know Lasso adds L1 (|β|) penalty: drives some coefficients to exactly zero (feature selection)
- Can use RidgeCV and LassoCV to find optimal alpha via cross-validation
- Can visualise coefficient shrinkage path across alpha values
- Know Elastic Net combines L1+L2 and when to prefer it over Ridge or Lasso
- Can apply PolynomialFeatures and understand the dimensionality explosion risk
- Can compute MAE, RMSE, R², and MAPE and know when to use each
- Know that R² < 0 means the model is worse than predicting the mean
- Can build and evaluate a Pipeline with preprocessing + regularised regression
- Completed project: House Prices with Linear, Ridge, Lasso compared in a results table
✅ When complete: Move to P3-M08 — Classification: logistic regression, decision trees, random forest, and SVM.