P2-M05 - Statistics, EDA & Visualisation

Part 2 — Statistics, EDA & Visualisation · Module 5 of 28

Statistics, EDA & Visualisation

Understand your data before modelling it — distributions, correlations, and visual storytelling

⏱ 2 Weeks 🟡 Intermediate 🔧 matplotlib · seaborn · scipy · pandas 📋 Prerequisite: P1-M02

🎯

What This Module Covers

Part 2 Start

Before building any model, you must understand your data. EDA (Exploratory Data Analysis) is the process of summarising, visualising, and questioning a dataset to find patterns, anomalies, and relationships. Skipping EDA is the most common cause of bad ML models.

Descriptive statistics — mean, median, mode, variance, std, percentiles, IQR
Distributions — normal, skewed, bimodal; what shape tells you about transformations
Correlation analysis — Pearson, Spearman, heatmaps, scatter matrix, multicollinearity
Visualisation toolkit — choosing the right plot for each question and data type
Full EDA workflow — from raw CSV to insight report in a systematic process
Outlier detection — Z-score, IQR method, Isolation Forest; treatment strategies

💡 EDA answers "what is the data?" before you ask "what should the model predict?" The most valuable EDA insight is often unexpected: a corrupted column, a leaky feature, or a distributional shift that invalidates your entire modelling approach. One hour of EDA saves ten hours of debugging a bad model.

📊

Descriptive Statistics — The Complete Toolkit

Foundation

import pandas as pd
import numpy as np
from scipy import stats

df = pd.read_csv("house_prices.csv")

# ── Phase 1: First Pass ────────────────────────────────
print(df.shape)               # (1460, 81) → rows, columns
print(df.dtypes.value_counts()) # how many numeric vs object cols
print(df.isnull().sum().sort_values(ascending=False)[:10])
print(df.nunique().sort_values())   # columns with few unique = likely categorical

# ── Phase 2: Summary Statistics ───────────────────────
df.describe()         # count, mean, std, min, 25%, 50%, 75%, max
df.describe(include="all")  # also shows top/freq for object cols

# ── Phase 3: Individual column statistics ─────────────
col = df["SalePrice"]

mean   = col.mean()       # arithmetic mean — sensitive to outliers
median = col.median()     # middle value — robust to outliers
mode   = col.mode()[0]   # most frequent value
std    = col.std()        # standard deviation
var    = col.var()        # variance = std²
q1     = col.quantile(0.25)
q3     = col.quantile(0.75)
iqr    = q3 - q1         # interquartile range
cv     = std / mean       # coefficient of variation (relative spread)

print(f"Mean: {mean:,.0f}  Median: {median:,.0f}  Diff: {mean-median:,.0f}")
print(f"IQR: {iqr:,.0f}  CV: {cv:.3f}")

# ── Phase 4: Shape statistics ─────────────────────────
skewness = col.skew()           # 0=symmetric, >0=right tail, <0=left tail
kurtosis = col.kurtosis()       # 0=normal tails, >0=heavy tails
print(f"Skewness: {skewness:.3f}  Kurtosis: {kurtosis:.3f}")

# ── Phase 5: Value counts for categoricals ────────────
print(df["Neighborhood"].value_counts())
print(df["Neighborhood"].value_counts(normalize=True).round(3))

# ── Automation: stats for ALL numeric columns ─────────
numeric_cols = df.select_dtypes("number").columns
stats_df = df[numeric_cols].agg(["mean", "median", "std",
    lambda x: x.skew(), lambda x: x.isnull().mean()]).T
stats_df.columns = ["mean", "median", "std", "skewness", "null_pct"]
print(stats_df.sort_values("skewness", ascending=False).head(10))

💡 Mean vs Median tells you about skew. If mean > median, the distribution has a right tail (outliers pulling the mean up). For house prices: mean=$180k, median=$163k — a few luxury homes inflate the mean. The median is more representative of the "typical" house. Always report both for financial or demographic data.

📈

Distributions — Shape Tells You Everything

Core Concept

import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import numpy as np

# ── Visualise distribution ────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Histogram + KDE
sns.histplot(df["SalePrice"], bins=40, kde=True, ax=axes[0])
axes[0].set_title(f"SalePrice (skew={df['SalePrice'].skew():.2f})")

# QQ-plot: if points lie on diagonal line → normal distribution
stats.probplot(df["SalePrice"].dropna(), plot=axes[1])
axes[1].set_title("QQ-Plot: Deviation from Normal")

# Log transform: right-skewed → more symmetric
log_price = np.log1p(df["SalePrice"])
sns.histplot(log_price, bins=40, kde=True, ax=axes[2])
axes[2].set_title(f"log(SalePrice) (skew={log_price.skew():.2f})")
plt.tight_layout()

# ── Formal normality test ─────────────────────────────
# H0: data is normally distributed. p > 0.05 = fail to reject
stat, p = stats.shapiro(df["SalePrice"].dropna().sample(500, random_state=42))
print(f"Shapiro-Wilk: stat={stat:.4f}, p={p:.6f}")
# Very small p → strongly non-normal (expected for house prices)

# ── Box-Cox optimal transformation ───────────────────
# lambda ~0 = log, ~0.5 = sqrt, ~1 = no transform
pos = df["SalePrice"].dropna()
transformed, lam = stats.boxcox(pos)
print(f"Optimal lambda: {lam:.3f} → "
      f"{'log' if abs(lam) < 0.1 else 'sqrt' if abs(lam-0.5) < 0.1 else f'x^{lam:.2f}'}")

# ── Find all highly skewed columns ────────────────────
skewed = df.select_dtypes("number").skew().sort_values(ascending=False)
high_skew = skewed[abs(skewed) > 1]
print(f"{len(high_skew)} columns with |skewness| > 1")
print(high_skew.head(10))

Distribution Types and Actions

Normal

Symmetric bell. Mean=median. 68-95-99.7 rule. Heights, errors.

✅ No transform needed

Right-Skewed

Long right tail, mean > median. Prices, salaries, counts.

→ Apply log1p or Box-Cox

Left-Skewed

Long left tail, mean < median. Test scores near maximum.

→ Apply square or reflect

Bimodal

Two peaks = two subpopulations. Always investigate before modelling.

⚠ Segment the data

Zero-Inflated

Many zeros + positive tail. Spend, transaction amount, counts.

→ log1p or separate zero/nonzero

Uniform

Flat. All values equally likely. IDs, random numbers. Often useless as a feature.

⚠ Check if meaningful

🔗

Correlation Analysis — Finding Feature Relationships

Feature Selection

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# ── Pearson correlation ───────────────────────────────
# Measures LINEAR relationship. Assumes both variables are continuous.
# Range: -1 (perfect negative) to +1 (perfect positive)
# Sensitive to outliers. Assumes roughly normal distributions.

corr = df.corr(numeric_only=True)

# Most correlated with target
target_corr = corr["SalePrice"].drop("SalePrice").sort_values(ascending=False)
print("Top 10 positive correlations:")
print(target_corr.head(10))
print("\nTop 5 negative correlations:")
print(target_corr.tail(5))

# Heatmap of top correlated features
top_feats = target_corr.abs().nlargest(12).index.tolist() + ["SalePrice"]
plt.figure(figsize=(10, 8))
sns.heatmap(df[top_feats].corr(), annot=True, fmt=".2f",
            cmap="RdBu_r", center=0, vmin=-1, vmax=1, square=True)
plt.title("Top Features — Correlation Matrix")
plt.tight_layout()

# ── Pearson statistical significance ─────────────────
r, p = stats.pearsonr(df["GrLivArea"], df["SalePrice"])
print(f"GrLivArea vs SalePrice: r={r:.3f}, p={p:.4f}")
# p < 0.001 = highly significant

# ── Spearman correlation ──────────────────────────────
# Measures MONOTONIC (not just linear) relationship
# More robust to outliers and non-normal distributions
# Better for ordinal data (OverallQual: 1-10 ratings)
spearman = df.corr(method="spearman", numeric_only=True)
print("Spearman top 5:", spearman["SalePrice"].nlargest(6))

# ── Multicollinearity detection ───────────────────────
# Features with r > 0.8 cause instability in linear models
def find_multicollinear_pairs(corr_matrix, threshold=0.8):
    pairs = []
    cols = corr_matrix.columns
    for i in range(len(cols)):
        for j in range(i+1, len(cols)):
            r = abs(corr_matrix.iloc[i, j])
            if r > threshold:
                pairs.append((cols[i], cols[j], round(r, 3)))
    return sorted(pairs, key=lambda x: -x[2])

pairs = find_multicollinear_pairs(corr)
for a, b, r in pairs:
    print(f"  {a} ↔ {b}: r={r}")

⚠️ Correlation ≠ causation. Ice cream sales and drowning deaths correlate (both peak in summer) but neither causes the other. Always ask "is there a plausible causal mechanism?" before treating correlation as meaningful. Also: Pearson measures linear relationships only — two variables can be strongly associated but r=0 if the relationship is non-linear (e.g. U-shaped).

🎨

Visualisation Toolkit — Right Plot for Every Question

Seaborn + Matplotlib

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")
sns.set_context("notebook")

# ── 1. Single distribution ────────────────────────────
sns.histplot(df["SalePrice"], bins=40, kde=True, color="steelblue")

# ── 2. Distribution by group ─────────────────────────
# Boxplot: median, IQR, whiskers (1.5*IQR), outlier dots
sns.boxplot(data=df, x="OverallQual", y="SalePrice")

# Violinplot: boxplot + KDE density (shows bimodal shapes)
sns.violinplot(data=df, x="OverallQual", y="SalePrice", inner="box")

# ── 3. Two continuous variables ───────────────────────
sns.scatterplot(data=df, x="GrLivArea", y="SalePrice",
                hue="OverallQual", palette="YlOrRd", alpha=0.6)

# Regression line + confidence interval
sns.regplot(data=df, x="GrLivArea", y="SalePrice",
            scatter_kws={"alpha": 0.3}, line_kws={"color": "red"})

# ── 4. All pairwise relationships ────────────────────
# Slow on >10 cols. Subset to key features first.
key = ["SalePrice", "GrLivArea", "OverallQual", "YearBuilt", "TotalBsmtSF"]
sns.pairplot(df[key], diag_kind="kde", plot_kws={"alpha": 0.4})

# ── 5. Categorical frequency ─────────────────────────
order = df["MSZoning"].value_counts().index
sns.countplot(data=df, x="MSZoning", order=order)

# ── 6. Categorical vs numeric ────────────────────────
order = df.groupby("Neighborhood")["SalePrice"].median().sort_values(ascending=False).index
fig, ax = plt.subplots(figsize=(14, 5))
sns.barplot(data=df, x="Neighborhood", y="SalePrice", order=order, ax=ax)
ax.tick_params(axis="x", rotation=45)

# ── 7. Multi-panel summary ────────────────────────────
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
sns.histplot(df["SalePrice"], kde=True, ax=axes[0, 0]).set_title("Price Distribution")
sns.boxplot(data=df, x="OverallQual", y="SalePrice", ax=axes[0, 1]).set_title("Quality vs Price")
sns.scatterplot(data=df, x="GrLivArea", y="SalePrice", alpha=0.3, ax=axes[0, 2]).set_title("Area vs Price")
corr_top = df[key].corr()
sns.heatmap(corr_top, annot=True, fmt=".2f", cmap="RdBu_r", center=0, ax=axes[1, 0])
sns.countplot(data=df, x="MSZoning", ax=axes[1, 1]).set_title("Zoning Distribution")
sns.histplot(df["YearBuilt"], bins=40, ax=axes[1, 2]).set_title("Year Built")
plt.tight_layout()
plt.savefig("eda_summary.png", dpi=150, bbox_inches="tight")

Plot Selection Guide

histplot / kdeplot

Single continuous distribution with optional smoothing.

Q: "What does this column look like?"

boxplot

Continuous vs categorical. Shows median, IQR, outliers.

Q: "How does price vary by quality?"

violinplot

Boxplot + density. Reveals bimodal distributions.

Q: "Is the distribution symmetric within each group?"

scatterplot

Two continuous variables. Add hue= for 3rd dimension.

Q: "Is there a relationship between X and Y?"

pairplot

All pairwise relationships. Comprehensive but slow.

Q: "Explore all feature interactions"

heatmap

Correlation matrix. Best with annot=True for <15 features.

Q: "Which features are correlated?"

🔍

Systematic EDA Workflow

Process

import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns
from scipy import stats

# ════════════════════════════════════════════════════
# STEP 1: DATA INVENTORY
# ════════════════════════════════════════════════════
print(f"Shape: {df.shape}")
print("Dtypes:", df.dtypes.value_counts().to_dict())
null_pct = df.isnull().mean().sort_values(ascending=False)
print("Columns with nulls:\n", null_pct[null_pct > 0])

numeric_cols     = df.select_dtypes("number").columns.tolist()
categorical_cols = df.select_dtypes("object").columns.tolist()

# Missing value heatmap
plt.figure(figsize=(14, 5))
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap="viridis")
plt.title("Missing Values (yellow = missing)")

# ════════════════════════════════════════════════════
# STEP 2: TARGET VARIABLE ANALYSIS
# ════════════════════════════════════════════════════
TARGET = "SalePrice"
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
sns.histplot(df[TARGET], kde=True, ax=axes[0]).set_title(f"Raw (skew={df[TARGET].skew():.2f})")
stats.probplot(df[TARGET].dropna(), plot=axes[1]); axes[1].set_title("QQ Plot")
log_t = np.log1p(df[TARGET])
sns.histplot(log_t, kde=True, ax=axes[2]).set_title(f"log() (skew={log_t.skew():.2f})")
plt.tight_layout()

# ════════════════════════════════════════════════════
# STEP 3: NUMERIC FEATURES
# ════════════════════════════════════════════════════
# Distribution overview
df[numeric_cols].hist(figsize=(20, 16), bins=30)
plt.tight_layout()

# Correlation with target
tc = df[numeric_cols].corr()[TARGET].drop(TARGET).sort_values(ascending=False)
plt.figure(figsize=(10, 8))
tc.plot(kind="barh", color=["green" if x > 0 else "red" for x in tc])
plt.title("Feature Correlation with SalePrice")
plt.axvline(0, color="black", linewidth=1)

# Scatter plots for top 6 features
top6 = tc.abs().nlargest(6).index
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
for ax, feat in zip(axes.flat, top6):
    r, _ = stats.pearsonr(df[feat].dropna(), df[TARGET][df[feat].notna()])
    ax.scatter(df[feat], df[TARGET], alpha=0.3, s=10)
    ax.set(xlabel=feat, ylabel=TARGET, title=f"r={r:.3f}")
plt.tight_layout()

# ════════════════════════════════════════════════════
# STEP 4: CATEGORICAL FEATURES
# ════════════════════════════════════════════════════
for col in categorical_cols[:6]:
    n_cats = df[col].nunique()
    if n_cats > 15: continue      # skip high-cardinality
    fig, ax = plt.subplots(figsize=(10, 4))
    order = df.groupby(col)[TARGET].median().sort_values(ascending=False).index
    sns.boxplot(data=df, x=col, y=TARGET, order=order, ax=ax)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
    ax.set_title(f"{col} vs {TARGET} (n_cats={n_cats})")
    plt.tight_layout()

🚨

Outlier Detection and Treatment

Data Quality

import numpy as np
from scipy import stats
from sklearn.ensemble import IsolationForest

# ── Method 1: IQR (most robust, skew-tolerant) ────────
def iqr_outlier_mask(series):
    q1, q3 = series.quantile(0.25), series.quantile(0.75)
    iqr = q3 - q1
    return (series < q1 - 1.5 * iqr) | (series > q3 + 1.5 * iqr)

price_outliers = iqr_outlier_mask(df["SalePrice"])
area_outliers  = iqr_outlier_mask(df["GrLivArea"])
print(f"Price IQR outliers: {price_outliers.sum()}")
print(f"Area  IQR outliers:  {area_outliers.sum()}")

# Visualise with scatter: outliers in red
fig, ax = plt.subplots(figsize=(8, 5))
both_outliers = price_outliers | area_outliers
ax.scatter(df.loc[~both_outliers, "GrLivArea"], df.loc[~both_outliers, "SalePrice"],
           alpha=0.4, s=10, label="Normal")
ax.scatter(df.loc[both_outliers, "GrLivArea"],  df.loc[both_outliers, "SalePrice"],
           color="red", s=30, label="Outlier")
ax.set(xlabel="GrLivArea", ylabel="SalePrice", title="IQR Outliers Flagged")
ax.legend()

# ── Method 2: Z-score (assumes normal distribution) ──
z = np.abs(stats.zscore(df[["SalePrice", "GrLivArea"]].dropna()))
z_outliers = (z > 3).any(axis=1)
print(f"Z-score outliers (|z|>3): {z_outliers.sum()}")

# ── Method 3: Isolation Forest (multivariate) ────────
numeric = df.select_dtypes("number").dropna()
iso = IsolationForest(contamination=0.05, random_state=42)
labels = iso.fit_predict(numeric)
multi_outliers = labels == -1
print(f"Isolation Forest: {multi_outliers.sum()} outliers")

# ── Treatment options ─────────────────────────────────
# Option 1: Remove — only if clearly erroneous, not just extreme
df_clean = df[~(area_outliers & price_outliers)].copy()
print(f"After removal: {len(df_clean)} rows (was {len(df)})")

# Option 2: Winsorise (cap at percentile) — preserves all rows
lo, hi = df["SalePrice"].quantile([0.01, 0.99])
df["SalePrice_w"] = df["SalePrice"].clip(lo, hi)

# Option 3: Log transform — mathematically compresses tails
df["SalePrice_log"] = np.log1p(df["SalePrice"])

💡 Before removing outliers, always inspect them individually. In the House Prices dataset, there are two houses with GrLivArea > 4000 sqft but low SalePrice — these are partial sales, not data errors. Removing them changes model behaviour significantly. Check the Kaggle competition discussion before deleting rows.

FREE LEARNING RESOURCES

Type	Resource	Best For
Course	Kaggle Data Visualisation Course (Free) — kaggle.com/learn/data-visualization	Best hands-on Seaborn & Matplotlib exercises. Interactive notebooks, immediate feedback.
Video	StatQuest — Statistics Fundamentals (YouTube) — youtube.com/c/joshstarmer	Best visual explanations of distributions, p-values, correlation, and statistical tests. No maths anxiety.
Docs	Seaborn Official Tutorial — seaborn.pydata.org/tutorial.html	Complete Seaborn reference with examples for every plot type. The authoritative source.
Docs	Matplotlib Gallery — matplotlib.org/stable/gallery	Hundreds of example plots with full copy-paste source code. Start from a working example.
Dataset	House Prices — Kaggle	Best EDA dataset: 79 features, interesting distributions, real-world messy data, Kaggle community.
Dataset	Titanic — Kaggle	Classic EDA dataset for categorical analysis and survival patterns. Well-documented community notebooks.
Course	Kaggle Feature Engineering Course — kaggle.com/learn/feature-engineering	Extends EDA into feature creation. Covers mutual information, encoding strategies.

🛠House Prices EDA Report[Intermediate] 5–6 days

Conduct a full EDA on the House Prices dataset and produce a polished visual report.

10 Required Visualisations

Target distribution: histogram + QQ plot of SalePrice, log-transformed version, skewness comparison
Missing values: heatmap with percentage annotations
Correlation: heatmap of top 15 features, bar chart of r-values with target
Scatter with colour: GrLivArea vs SalePrice coloured by OverallQual
Categorical analysis: boxplots of SalePrice by Neighborhood (sorted by median), OverallQual
Distributions: histograms of top 5 right-skewed features before and after log transform
Outlier plot: GrLivArea vs SalePrice with outliers highlighted in red

Deliverable: Jupyter notebook with all plots + 5 written insights that would guide modelling decisions.

🛠Titanic Survival EDA[Beginner] 3–4 days

EDA focused on categorical patterns and survival rate analysis.

Survival rates by: Sex, Pclass, Embarked, Age group (10-year bins)
Age distribution by survival status (overlapping histograms with alpha)
Fare distribution — detect and annotate outliers
Correlation heatmap for numeric features
Conclusion: "Which 3 features would you include in a model and why?"

LAB 1

Descriptive Statistics Deep Dive

Load House Prices. For SalePrice compute mean, median, std, skewness, kurtosis. Verify they match df.describe() and col.skew(). Print: "Mean exceeds median by X%, indicating right skew."

Find all numeric columns with |skewness| > 1. Apply log1p transform. Re-compute. How many now have |skewness| < 0.5? Print a before/after comparison table.

Compute coefficient of variation (std/mean) for all numeric columns. Find the 3 with highest CV. What does high CV mean for model sensitivity?

Run Shapiro-Wilk on SalePrice (n=500). Then on log(SalePrice). Compare p-values and conclude: which version is more suitable for linear regression?

LAB 2

Correlation and Multicollinearity

Compute Pearson correlation matrix. Find the 10 features most correlated with SalePrice. For each, compute the statistical significance (scipy.stats.pearsonr). Are all 10 significant?

Find all pairs with |r| > 0.8. Visualise one such pair with a scatter plot. Explain in one sentence why having both features in a linear model would be problematic.

Compare Pearson vs Spearman for OverallQual vs SalePrice. OverallQual is ordinal (integers 1–10). Which correlation is more appropriate and why?

Build a publication-quality correlation heatmap: top 12 features + SalePrice, annot=True, fmt=".2f", centre=0, square=True, title. Save as PNG at 150 DPI.

LAB 3

Outlier Detection Comparison

Apply IQR outlier detection to GrLivArea. Plot them as red points on a GrLivArea vs SalePrice scatter. How many outliers are flagged?

Inspect the 2 points with GrLivArea > 4000 but SalePrice < $300k. Look at their other features (MSZoning, SaleType, SaleCondition). Are these data errors or partial sales?

Apply Isolation Forest (contamination=0.05). Compare its outlier list vs IQR. Calculate: what % of IQR outliers are also flagged by Isolation Forest?

Create 3 versions of SalePrice: raw, Winsorised (1st/99th pct), log. Plot all 3 histograms side by side with their skewness annotated. Which version would you use for linear regression?

P2-M05 MASTERY CHECKLIST

Can compute mean, median, mode, std, IQR, skewness, kurtosis for any pandas Series
Know when mean > median signals right skew and what treatment to apply
Can interpret distribution shape from histogram: normal, right-skewed, left-skewed, bimodal
Can apply log1p transform and verify it reduces skewness
Can run Shapiro-Wilk normality test and interpret the p-value
Can compute Pearson correlation matrix and identify the features most correlated with target
Can find multicollinear feature pairs (|r| > 0.8) and understand why they are problematic
Know the difference between Pearson and Spearman and when to use each
Can produce: histplot, boxplot, violinplot, scatterplot, regplot, pairplot, heatmap, barplot, countplot
Can choose the right plot for a given question and data type combination
Can save publication-quality plots with savefig(dpi=150, bbox_inches='tight')
Can follow the 4-phase EDA workflow: inventory → target → numeric → categorical
Can detect outliers with IQR method, Z-score, and Isolation Forest
Can choose between removing, capping (Winsorising), and log-transforming outliers
Completed project: House Prices EDA report with 10 visualisations and written conclusions

✅ When complete: Move to P2-M06 — ML Workflow: feature engineering, scaling, encoding, train/test split, and sklearn Pipelines.

← P1-M04: SQL & FastAPI 🗺️ All Modules Next: P2-M06 — ML Workflow →