Part 2 — Statistics, EDA & Visualisation  ·  Module 5 of 28
Statistics, EDA & Visualisation
Understand your data before modelling it — distributions, correlations, and visual storytelling
⏱ 2 Weeks 🟡 Intermediate 🔧 matplotlib · seaborn · scipy · pandas 📋 Prerequisite: P1-M02
🎯

What This Module Covers

Part 2 Start

Before building any model, you must understand your data. EDA (Exploratory Data Analysis) is the process of summarising, visualising, and questioning a dataset to find patterns, anomalies, and relationships. Skipping EDA is the most common cause of bad ML models.

  • Descriptive statistics — mean, median, mode, variance, std, percentiles, IQR
  • Distributions — normal, skewed, bimodal; what shape tells you about transformations
  • Correlation analysis — Pearson, Spearman, heatmaps, scatter matrix, multicollinearity
  • Visualisation toolkit — choosing the right plot for each question and data type
  • Full EDA workflow — from raw CSV to insight report in a systematic process
  • Outlier detection — Z-score, IQR method, Isolation Forest; treatment strategies

💡 EDA answers "what is the data?" before you ask "what should the model predict?" The most valuable EDA insight is often unexpected: a corrupted column, a leaky feature, or a distributional shift that invalidates your entire modelling approach. One hour of EDA saves ten hours of debugging a bad model.

📊

Descriptive Statistics — The Complete Toolkit

Foundation
import pandas as pd
import numpy as np
from scipy import stats

df = pd.read_csv("house_prices.csv")

# ── Phase 1: First Pass ────────────────────────────────
print(df.shape)               # (1460, 81) → rows, columns
print(df.dtypes.value_counts()) # how many numeric vs object cols
print(df.isnull().sum().sort_values(ascending=False)[:10])
print(df.nunique().sort_values())   # columns with few unique = likely categorical

# ── Phase 2: Summary Statistics ───────────────────────
df.describe()         # count, mean, std, min, 25%, 50%, 75%, max
df.describe(include="all")  # also shows top/freq for object cols

# ── Phase 3: Individual column statistics ─────────────
col = df["SalePrice"]

mean   = col.mean()       # arithmetic mean — sensitive to outliers
median = col.median()     # middle value — robust to outliers
mode   = col.mode()[0]   # most frequent value
std    = col.std()        # standard deviation
var    = col.var()        # variance = std²
q1     = col.quantile(0.25)
q3     = col.quantile(0.75)
iqr    = q3 - q1         # interquartile range
cv     = std / mean       # coefficient of variation (relative spread)

print(f"Mean: {mean:,.0f}  Median: {median:,.0f}  Diff: {mean-median:,.0f}")
print(f"IQR: {iqr:,.0f}  CV: {cv:.3f}")

# ── Phase 4: Shape statistics ─────────────────────────
skewness = col.skew()           # 0=symmetric, >0=right tail, <0=left tail
kurtosis = col.kurtosis()       # 0=normal tails, >0=heavy tails
print(f"Skewness: {skewness:.3f}  Kurtosis: {kurtosis:.3f}")

# ── Phase 5: Value counts for categoricals ────────────
print(df["Neighborhood"].value_counts())
print(df["Neighborhood"].value_counts(normalize=True).round(3))

# ── Automation: stats for ALL numeric columns ─────────
numeric_cols = df.select_dtypes("number").columns
stats_df = df[numeric_cols].agg(["mean", "median", "std",
    lambda x: x.skew(), lambda x: x.isnull().mean()]).T
stats_df.columns = ["mean", "median", "std", "skewness", "null_pct"]
print(stats_df.sort_values("skewness", ascending=False).head(10))

💡 Mean vs Median tells you about skew. If mean > median, the distribution has a right tail (outliers pulling the mean up). For house prices: mean=$180k, median=$163k — a few luxury homes inflate the mean. The median is more representative of the "typical" house. Always report both for financial or demographic data.

📈

Distributions — Shape Tells You Everything

Core Concept
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import numpy as np

# ── Visualise distribution ────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Histogram + KDE
sns.histplot(df["SalePrice"], bins=40, kde=True, ax=axes[0])
axes[0].set_title(f"SalePrice (skew={df['SalePrice'].skew():.2f})")

# QQ-plot: if points lie on diagonal line → normal distribution
stats.probplot(df["SalePrice"].dropna(), plot=axes[1])
axes[1].set_title("QQ-Plot: Deviation from Normal")

# Log transform: right-skewed → more symmetric
log_price = np.log1p(df["SalePrice"])
sns.histplot(log_price, bins=40, kde=True, ax=axes[2])
axes[2].set_title(f"log(SalePrice) (skew={log_price.skew():.2f})")
plt.tight_layout()

# ── Formal normality test ─────────────────────────────
# H0: data is normally distributed. p > 0.05 = fail to reject
stat, p = stats.shapiro(df["SalePrice"].dropna().sample(500, random_state=42))
print(f"Shapiro-Wilk: stat={stat:.4f}, p={p:.6f}")
# Very small p → strongly non-normal (expected for house prices)

# ── Box-Cox optimal transformation ───────────────────
# lambda ~0 = log, ~0.5 = sqrt, ~1 = no transform
pos = df["SalePrice"].dropna()
transformed, lam = stats.boxcox(pos)
print(f"Optimal lambda: {lam:.3f} → "
      f"{'log' if abs(lam) < 0.1 else 'sqrt' if abs(lam-0.5) < 0.1 else f'x^{lam:.2f}'}")

# ── Find all highly skewed columns ────────────────────
skewed = df.select_dtypes("number").skew().sort_values(ascending=False)
high_skew = skewed[abs(skewed) > 1]
print(f"{len(high_skew)} columns with |skewness| > 1")
print(high_skew.head(10))

Distribution Types and Actions

Normal

Symmetric bell. Mean=median. 68-95-99.7 rule. Heights, errors.

✅ No transform needed
Right-Skewed

Long right tail, mean > median. Prices, salaries, counts.

→ Apply log1p or Box-Cox
Left-Skewed

Long left tail, mean < median. Test scores near maximum.

→ Apply square or reflect
Bimodal

Two peaks = two subpopulations. Always investigate before modelling.

⚠ Segment the data
Zero-Inflated

Many zeros + positive tail. Spend, transaction amount, counts.

→ log1p or separate zero/nonzero
Uniform

Flat. All values equally likely. IDs, random numbers. Often useless as a feature.

⚠ Check if meaningful
🔗

Correlation Analysis — Finding Feature Relationships

Feature Selection
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# ── Pearson correlation ───────────────────────────────
# Measures LINEAR relationship. Assumes both variables are continuous.
# Range: -1 (perfect negative) to +1 (perfect positive)
# Sensitive to outliers. Assumes roughly normal distributions.

corr = df.corr(numeric_only=True)

# Most correlated with target
target_corr = corr["SalePrice"].drop("SalePrice").sort_values(ascending=False)
print("Top 10 positive correlations:")
print(target_corr.head(10))
print("\nTop 5 negative correlations:")
print(target_corr.tail(5))

# Heatmap of top correlated features
top_feats = target_corr.abs().nlargest(12).index.tolist() + ["SalePrice"]
plt.figure(figsize=(10, 8))
sns.heatmap(df[top_feats].corr(), annot=True, fmt=".2f",
            cmap="RdBu_r", center=0, vmin=-1, vmax=1, square=True)
plt.title("Top Features — Correlation Matrix")
plt.tight_layout()

# ── Pearson statistical significance ─────────────────
r, p = stats.pearsonr(df["GrLivArea"], df["SalePrice"])
print(f"GrLivArea vs SalePrice: r={r:.3f}, p={p:.4f}")
# p < 0.001 = highly significant

# ── Spearman correlation ──────────────────────────────
# Measures MONOTONIC (not just linear) relationship
# More robust to outliers and non-normal distributions
# Better for ordinal data (OverallQual: 1-10 ratings)
spearman = df.corr(method="spearman", numeric_only=True)
print("Spearman top 5:", spearman["SalePrice"].nlargest(6))

# ── Multicollinearity detection ───────────────────────
# Features with r > 0.8 cause instability in linear models
def find_multicollinear_pairs(corr_matrix, threshold=0.8):
    pairs = []
    cols = corr_matrix.columns
    for i in range(len(cols)):
        for j in range(i+1, len(cols)):
            r = abs(corr_matrix.iloc[i, j])
            if r > threshold:
                pairs.append((cols[i], cols[j], round(r, 3)))
    return sorted(pairs, key=lambda x: -x[2])

pairs = find_multicollinear_pairs(corr)
for a, b, r in pairs:
    print(f"  {a} ↔ {b}: r={r}")

⚠️ Correlation ≠ causation. Ice cream sales and drowning deaths correlate (both peak in summer) but neither causes the other. Always ask "is there a plausible causal mechanism?" before treating correlation as meaningful. Also: Pearson measures linear relationships only — two variables can be strongly associated but r=0 if the relationship is non-linear (e.g. U-shaped).

🎨

Visualisation Toolkit — Right Plot for Every Question

Seaborn + Matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")
sns.set_context("notebook")

# ── 1. Single distribution ────────────────────────────
sns.histplot(df["SalePrice"], bins=40, kde=True, color="steelblue")

# ── 2. Distribution by group ─────────────────────────
# Boxplot: median, IQR, whiskers (1.5*IQR), outlier dots
sns.boxplot(data=df, x="OverallQual", y="SalePrice")

# Violinplot: boxplot + KDE density (shows bimodal shapes)
sns.violinplot(data=df, x="OverallQual", y="SalePrice", inner="box")

# ── 3. Two continuous variables ───────────────────────
sns.scatterplot(data=df, x="GrLivArea", y="SalePrice",
                hue="OverallQual", palette="YlOrRd", alpha=0.6)

# Regression line + confidence interval
sns.regplot(data=df, x="GrLivArea", y="SalePrice",
            scatter_kws={"alpha": 0.3}, line_kws={"color": "red"})

# ── 4. All pairwise relationships ────────────────────
# Slow on >10 cols. Subset to key features first.
key = ["SalePrice", "GrLivArea", "OverallQual", "YearBuilt", "TotalBsmtSF"]
sns.pairplot(df[key], diag_kind="kde", plot_kws={"alpha": 0.4})

# ── 5. Categorical frequency ─────────────────────────
order = df["MSZoning"].value_counts().index
sns.countplot(data=df, x="MSZoning", order=order)

# ── 6. Categorical vs numeric ────────────────────────
order = df.groupby("Neighborhood")["SalePrice"].median().sort_values(ascending=False).index
fig, ax = plt.subplots(figsize=(14, 5))
sns.barplot(data=df, x="Neighborhood", y="SalePrice", order=order, ax=ax)
ax.tick_params(axis="x", rotation=45)

# ── 7. Multi-panel summary ────────────────────────────
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
sns.histplot(df["SalePrice"], kde=True, ax=axes[0, 0]).set_title("Price Distribution")
sns.boxplot(data=df, x="OverallQual", y="SalePrice", ax=axes[0, 1]).set_title("Quality vs Price")
sns.scatterplot(data=df, x="GrLivArea", y="SalePrice", alpha=0.3, ax=axes[0, 2]).set_title("Area vs Price")
corr_top = df[key].corr()
sns.heatmap(corr_top, annot=True, fmt=".2f", cmap="RdBu_r", center=0, ax=axes[1, 0])
sns.countplot(data=df, x="MSZoning", ax=axes[1, 1]).set_title("Zoning Distribution")
sns.histplot(df["YearBuilt"], bins=40, ax=axes[1, 2]).set_title("Year Built")
plt.tight_layout()
plt.savefig("eda_summary.png", dpi=150, bbox_inches="tight")

Plot Selection Guide

histplot / kdeplot

Single continuous distribution with optional smoothing.

Q: "What does this column look like?"
boxplot

Continuous vs categorical. Shows median, IQR, outliers.

Q: "How does price vary by quality?"
violinplot

Boxplot + density. Reveals bimodal distributions.

Q: "Is the distribution symmetric within each group?"
scatterplot

Two continuous variables. Add hue= for 3rd dimension.

Q: "Is there a relationship between X and Y?"
pairplot

All pairwise relationships. Comprehensive but slow.

Q: "Explore all feature interactions"
heatmap

Correlation matrix. Best with annot=True for <15 features.

Q: "Which features are correlated?"
🔍

Systematic EDA Workflow

Process
import pandas as pd, numpy as np
import matplotlib.pyplot as plt, seaborn as sns
from scipy import stats

# ════════════════════════════════════════════════════
# STEP 1: DATA INVENTORY
# ════════════════════════════════════════════════════
print(f"Shape: {df.shape}")
print("Dtypes:", df.dtypes.value_counts().to_dict())
null_pct = df.isnull().mean().sort_values(ascending=False)
print("Columns with nulls:\n", null_pct[null_pct > 0])

numeric_cols     = df.select_dtypes("number").columns.tolist()
categorical_cols = df.select_dtypes("object").columns.tolist()

# Missing value heatmap
plt.figure(figsize=(14, 5))
sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap="viridis")
plt.title("Missing Values (yellow = missing)")

# ════════════════════════════════════════════════════
# STEP 2: TARGET VARIABLE ANALYSIS
# ════════════════════════════════════════════════════
TARGET = "SalePrice"
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
sns.histplot(df[TARGET], kde=True, ax=axes[0]).set_title(f"Raw (skew={df[TARGET].skew():.2f})")
stats.probplot(df[TARGET].dropna(), plot=axes[1]); axes[1].set_title("QQ Plot")
log_t = np.log1p(df[TARGET])
sns.histplot(log_t, kde=True, ax=axes[2]).set_title(f"log() (skew={log_t.skew():.2f})")
plt.tight_layout()

# ════════════════════════════════════════════════════
# STEP 3: NUMERIC FEATURES
# ════════════════════════════════════════════════════
# Distribution overview
df[numeric_cols].hist(figsize=(20, 16), bins=30)
plt.tight_layout()

# Correlation with target
tc = df[numeric_cols].corr()[TARGET].drop(TARGET).sort_values(ascending=False)
plt.figure(figsize=(10, 8))
tc.plot(kind="barh", color=["green" if x > 0 else "red" for x in tc])
plt.title("Feature Correlation with SalePrice")
plt.axvline(0, color="black", linewidth=1)

# Scatter plots for top 6 features
top6 = tc.abs().nlargest(6).index
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
for ax, feat in zip(axes.flat, top6):
    r, _ = stats.pearsonr(df[feat].dropna(), df[TARGET][df[feat].notna()])
    ax.scatter(df[feat], df[TARGET], alpha=0.3, s=10)
    ax.set(xlabel=feat, ylabel=TARGET, title=f"r={r:.3f}")
plt.tight_layout()

# ════════════════════════════════════════════════════
# STEP 4: CATEGORICAL FEATURES
# ════════════════════════════════════════════════════
for col in categorical_cols[:6]:
    n_cats = df[col].nunique()
    if n_cats > 15: continue      # skip high-cardinality
    fig, ax = plt.subplots(figsize=(10, 4))
    order = df.groupby(col)[TARGET].median().sort_values(ascending=False).index
    sns.boxplot(data=df, x=col, y=TARGET, order=order, ax=ax)
    ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
    ax.set_title(f"{col} vs {TARGET} (n_cats={n_cats})")
    plt.tight_layout()
🚨

Outlier Detection and Treatment

Data Quality
import numpy as np
from scipy import stats
from sklearn.ensemble import IsolationForest

# ── Method 1: IQR (most robust, skew-tolerant) ────────
def iqr_outlier_mask(series):
    q1, q3 = series.quantile(0.25), series.quantile(0.75)
    iqr = q3 - q1
    return (series < q1 - 1.5 * iqr) | (series > q3 + 1.5 * iqr)

price_outliers = iqr_outlier_mask(df["SalePrice"])
area_outliers  = iqr_outlier_mask(df["GrLivArea"])
print(f"Price IQR outliers: {price_outliers.sum()}")
print(f"Area  IQR outliers:  {area_outliers.sum()}")

# Visualise with scatter: outliers in red
fig, ax = plt.subplots(figsize=(8, 5))
both_outliers = price_outliers | area_outliers
ax.scatter(df.loc[~both_outliers, "GrLivArea"], df.loc[~both_outliers, "SalePrice"],
           alpha=0.4, s=10, label="Normal")
ax.scatter(df.loc[both_outliers, "GrLivArea"],  df.loc[both_outliers, "SalePrice"],
           color="red", s=30, label="Outlier")
ax.set(xlabel="GrLivArea", ylabel="SalePrice", title="IQR Outliers Flagged")
ax.legend()

# ── Method 2: Z-score (assumes normal distribution) ──
z = np.abs(stats.zscore(df[["SalePrice", "GrLivArea"]].dropna()))
z_outliers = (z > 3).any(axis=1)
print(f"Z-score outliers (|z|>3): {z_outliers.sum()}")

# ── Method 3: Isolation Forest (multivariate) ────────
numeric = df.select_dtypes("number").dropna()
iso = IsolationForest(contamination=0.05, random_state=42)
labels = iso.fit_predict(numeric)
multi_outliers = labels == -1
print(f"Isolation Forest: {multi_outliers.sum()} outliers")

# ── Treatment options ─────────────────────────────────
# Option 1: Remove — only if clearly erroneous, not just extreme
df_clean = df[~(area_outliers & price_outliers)].copy()
print(f"After removal: {len(df_clean)} rows (was {len(df)})")

# Option 2: Winsorise (cap at percentile) — preserves all rows
lo, hi = df["SalePrice"].quantile([0.01, 0.99])
df["SalePrice_w"] = df["SalePrice"].clip(lo, hi)

# Option 3: Log transform — mathematically compresses tails
df["SalePrice_log"] = np.log1p(df["SalePrice"])

💡 Before removing outliers, always inspect them individually. In the House Prices dataset, there are two houses with GrLivArea > 4000 sqft but low SalePrice — these are partial sales, not data errors. Removing them changes model behaviour significantly. Check the Kaggle competition discussion before deleting rows.

FREE LEARNING RESOURCES

TypeResourceBest For
CourseKaggle Data Visualisation Course (Free) — kaggle.com/learn/data-visualizationBest hands-on Seaborn & Matplotlib exercises. Interactive notebooks, immediate feedback.
VideoStatQuest — Statistics Fundamentals (YouTube) — youtube.com/c/joshstarmerBest visual explanations of distributions, p-values, correlation, and statistical tests. No maths anxiety.
DocsSeaborn Official Tutorial — seaborn.pydata.org/tutorial.htmlComplete Seaborn reference with examples for every plot type. The authoritative source.
DocsMatplotlib Gallery — matplotlib.org/stable/galleryHundreds of example plots with full copy-paste source code. Start from a working example.
DatasetHouse Prices — KaggleBest EDA dataset: 79 features, interesting distributions, real-world messy data, Kaggle community.
DatasetTitanic — KaggleClassic EDA dataset for categorical analysis and survival patterns. Well-documented community notebooks.
CourseKaggle Feature Engineering Course — kaggle.com/learn/feature-engineeringExtends EDA into feature creation. Covers mutual information, encoding strategies.
🛠House Prices EDA Report[Intermediate] 5–6 days

Conduct a full EDA on the House Prices dataset and produce a polished visual report.

10 Required Visualisations

  • Target distribution: histogram + QQ plot of SalePrice, log-transformed version, skewness comparison
  • Missing values: heatmap with percentage annotations
  • Correlation: heatmap of top 15 features, bar chart of r-values with target
  • Scatter with colour: GrLivArea vs SalePrice coloured by OverallQual
  • Categorical analysis: boxplots of SalePrice by Neighborhood (sorted by median), OverallQual
  • Distributions: histograms of top 5 right-skewed features before and after log transform
  • Outlier plot: GrLivArea vs SalePrice with outliers highlighted in red

Deliverable: Jupyter notebook with all plots + 5 written insights that would guide modelling decisions.

🛠Titanic Survival EDA[Beginner] 3–4 days

EDA focused on categorical patterns and survival rate analysis.

  • Survival rates by: Sex, Pclass, Embarked, Age group (10-year bins)
  • Age distribution by survival status (overlapping histograms with alpha)
  • Fare distribution — detect and annotate outliers
  • Correlation heatmap for numeric features
  • Conclusion: "Which 3 features would you include in a model and why?"
LAB 1

Descriptive Statistics Deep Dive

1
Load House Prices. For SalePrice compute mean, median, std, skewness, kurtosis. Verify they match df.describe() and col.skew(). Print: "Mean exceeds median by X%, indicating right skew."
2
Find all numeric columns with |skewness| > 1. Apply log1p transform. Re-compute. How many now have |skewness| < 0.5? Print a before/after comparison table.
3
Compute coefficient of variation (std/mean) for all numeric columns. Find the 3 with highest CV. What does high CV mean for model sensitivity?
4
Run Shapiro-Wilk on SalePrice (n=500). Then on log(SalePrice). Compare p-values and conclude: which version is more suitable for linear regression?
LAB 2

Correlation and Multicollinearity

1
Compute Pearson correlation matrix. Find the 10 features most correlated with SalePrice. For each, compute the statistical significance (scipy.stats.pearsonr). Are all 10 significant?
2
Find all pairs with |r| > 0.8. Visualise one such pair with a scatter plot. Explain in one sentence why having both features in a linear model would be problematic.
3
Compare Pearson vs Spearman for OverallQual vs SalePrice. OverallQual is ordinal (integers 1–10). Which correlation is more appropriate and why?
4
Build a publication-quality correlation heatmap: top 12 features + SalePrice, annot=True, fmt=".2f", centre=0, square=True, title. Save as PNG at 150 DPI.
LAB 3

Outlier Detection Comparison

1
Apply IQR outlier detection to GrLivArea. Plot them as red points on a GrLivArea vs SalePrice scatter. How many outliers are flagged?
2
Inspect the 2 points with GrLivArea > 4000 but SalePrice < $300k. Look at their other features (MSZoning, SaleType, SaleCondition). Are these data errors or partial sales?
3
Apply Isolation Forest (contamination=0.05). Compare its outlier list vs IQR. Calculate: what % of IQR outliers are also flagged by Isolation Forest?
4
Create 3 versions of SalePrice: raw, Winsorised (1st/99th pct), log. Plot all 3 histograms side by side with their skewness annotated. Which version would you use for linear regression?

P2-M05 MASTERY CHECKLIST

When complete: Move to P2-M06 — ML Workflow: feature engineering, scaling, encoding, train/test split, and sklearn Pipelines.

← P1-M04: SQL & FastAPI 🗺️ All Modules Next: P2-M06 — ML Workflow →