What This Module Covers
Part 2 StartBefore building any model, you must understand your data. EDA (Exploratory Data Analysis) is the process of summarising, visualising, and questioning a dataset to find patterns, anomalies, and relationships. Skipping EDA is the most common cause of bad ML models.
- Descriptive statistics — mean, median, mode, variance, std, percentiles, IQR
- Distributions — normal, skewed, bimodal; what shape tells you about transformations
- Correlation analysis — Pearson, Spearman, heatmaps, scatter matrix, multicollinearity
- Visualisation toolkit — choosing the right plot for each question and data type
- Full EDA workflow — from raw CSV to insight report in a systematic process
- Outlier detection — Z-score, IQR method, Isolation Forest; treatment strategies
💡 EDA answers "what is the data?" before you ask "what should the model predict?" The most valuable EDA insight is often unexpected: a corrupted column, a leaky feature, or a distributional shift that invalidates your entire modelling approach. One hour of EDA saves ten hours of debugging a bad model.
Descriptive Statistics — The Complete Toolkit
Foundationimport pandas as pd import numpy as np from scipy import stats df = pd.read_csv("house_prices.csv") # ── Phase 1: First Pass ──────────────────────────────── print(df.shape) # (1460, 81) → rows, columns print(df.dtypes.value_counts()) # how many numeric vs object cols print(df.isnull().sum().sort_values(ascending=False)[:10]) print(df.nunique().sort_values()) # columns with few unique = likely categorical # ── Phase 2: Summary Statistics ─────────────────────── df.describe() # count, mean, std, min, 25%, 50%, 75%, max df.describe(include="all") # also shows top/freq for object cols # ── Phase 3: Individual column statistics ───────────── col = df["SalePrice"] mean = col.mean() # arithmetic mean — sensitive to outliers median = col.median() # middle value — robust to outliers mode = col.mode()[0] # most frequent value std = col.std() # standard deviation var = col.var() # variance = std² q1 = col.quantile(0.25) q3 = col.quantile(0.75) iqr = q3 - q1 # interquartile range cv = std / mean # coefficient of variation (relative spread) print(f"Mean: {mean:,.0f} Median: {median:,.0f} Diff: {mean-median:,.0f}") print(f"IQR: {iqr:,.0f} CV: {cv:.3f}") # ── Phase 4: Shape statistics ───────────────────────── skewness = col.skew() # 0=symmetric, >0=right tail, <0=left tail kurtosis = col.kurtosis() # 0=normal tails, >0=heavy tails print(f"Skewness: {skewness:.3f} Kurtosis: {kurtosis:.3f}") # ── Phase 5: Value counts for categoricals ──────────── print(df["Neighborhood"].value_counts()) print(df["Neighborhood"].value_counts(normalize=True).round(3)) # ── Automation: stats for ALL numeric columns ───────── numeric_cols = df.select_dtypes("number").columns stats_df = df[numeric_cols].agg(["mean", "median", "std", lambda x: x.skew(), lambda x: x.isnull().mean()]).T stats_df.columns = ["mean", "median", "std", "skewness", "null_pct"] print(stats_df.sort_values("skewness", ascending=False).head(10))
💡 Mean vs Median tells you about skew. If mean > median, the distribution has a right tail (outliers pulling the mean up). For house prices: mean=$180k, median=$163k — a few luxury homes inflate the mean. The median is more representative of the "typical" house. Always report both for financial or demographic data.
Distributions — Shape Tells You Everything
Core Conceptimport matplotlib.pyplot as plt import seaborn as sns from scipy import stats import numpy as np # ── Visualise distribution ──────────────────────────── fig, axes = plt.subplots(1, 3, figsize=(15, 4)) # Histogram + KDE sns.histplot(df["SalePrice"], bins=40, kde=True, ax=axes[0]) axes[0].set_title(f"SalePrice (skew={df['SalePrice'].skew():.2f})") # QQ-plot: if points lie on diagonal line → normal distribution stats.probplot(df["SalePrice"].dropna(), plot=axes[1]) axes[1].set_title("QQ-Plot: Deviation from Normal") # Log transform: right-skewed → more symmetric log_price = np.log1p(df["SalePrice"]) sns.histplot(log_price, bins=40, kde=True, ax=axes[2]) axes[2].set_title(f"log(SalePrice) (skew={log_price.skew():.2f})") plt.tight_layout() # ── Formal normality test ───────────────────────────── # H0: data is normally distributed. p > 0.05 = fail to reject stat, p = stats.shapiro(df["SalePrice"].dropna().sample(500, random_state=42)) print(f"Shapiro-Wilk: stat={stat:.4f}, p={p:.6f}") # Very small p → strongly non-normal (expected for house prices) # ── Box-Cox optimal transformation ─────────────────── # lambda ~0 = log, ~0.5 = sqrt, ~1 = no transform pos = df["SalePrice"].dropna() transformed, lam = stats.boxcox(pos) print(f"Optimal lambda: {lam:.3f} → " f"{'log' if abs(lam) < 0.1 else 'sqrt' if abs(lam-0.5) < 0.1 else f'x^{lam:.2f}'}") # ── Find all highly skewed columns ──────────────────── skewed = df.select_dtypes("number").skew().sort_values(ascending=False) high_skew = skewed[abs(skewed) > 1] print(f"{len(high_skew)} columns with |skewness| > 1") print(high_skew.head(10))
Distribution Types and Actions
Normal
Symmetric bell. Mean=median. 68-95-99.7 rule. Heights, errors.
Right-Skewed
Long right tail, mean > median. Prices, salaries, counts.
Left-Skewed
Long left tail, mean < median. Test scores near maximum.
Bimodal
Two peaks = two subpopulations. Always investigate before modelling.
Zero-Inflated
Many zeros + positive tail. Spend, transaction amount, counts.
Uniform
Flat. All values equally likely. IDs, random numbers. Often useless as a feature.
Correlation Analysis — Finding Feature Relationships
Feature Selectionimport pandas as pd import seaborn as sns import matplotlib.pyplot as plt from scipy import stats # ── Pearson correlation ─────────────────────────────── # Measures LINEAR relationship. Assumes both variables are continuous. # Range: -1 (perfect negative) to +1 (perfect positive) # Sensitive to outliers. Assumes roughly normal distributions. corr = df.corr(numeric_only=True) # Most correlated with target target_corr = corr["SalePrice"].drop("SalePrice").sort_values(ascending=False) print("Top 10 positive correlations:") print(target_corr.head(10)) print("\nTop 5 negative correlations:") print(target_corr.tail(5)) # Heatmap of top correlated features top_feats = target_corr.abs().nlargest(12).index.tolist() + ["SalePrice"] plt.figure(figsize=(10, 8)) sns.heatmap(df[top_feats].corr(), annot=True, fmt=".2f", cmap="RdBu_r", center=0, vmin=-1, vmax=1, square=True) plt.title("Top Features — Correlation Matrix") plt.tight_layout() # ── Pearson statistical significance ───────────────── r, p = stats.pearsonr(df["GrLivArea"], df["SalePrice"]) print(f"GrLivArea vs SalePrice: r={r:.3f}, p={p:.4f}") # p < 0.001 = highly significant # ── Spearman correlation ────────────────────────────── # Measures MONOTONIC (not just linear) relationship # More robust to outliers and non-normal distributions # Better for ordinal data (OverallQual: 1-10 ratings) spearman = df.corr(method="spearman", numeric_only=True) print("Spearman top 5:", spearman["SalePrice"].nlargest(6)) # ── Multicollinearity detection ─────────────────────── # Features with r > 0.8 cause instability in linear models def find_multicollinear_pairs(corr_matrix, threshold=0.8): pairs = [] cols = corr_matrix.columns for i in range(len(cols)): for j in range(i+1, len(cols)): r = abs(corr_matrix.iloc[i, j]) if r > threshold: pairs.append((cols[i], cols[j], round(r, 3))) return sorted(pairs, key=lambda x: -x[2]) pairs = find_multicollinear_pairs(corr) for a, b, r in pairs: print(f" {a} ↔ {b}: r={r}")
⚠️ Correlation ≠ causation. Ice cream sales and drowning deaths correlate (both peak in summer) but neither causes the other. Always ask "is there a plausible causal mechanism?" before treating correlation as meaningful. Also: Pearson measures linear relationships only — two variables can be strongly associated but r=0 if the relationship is non-linear (e.g. U-shaped).
Visualisation Toolkit — Right Plot for Every Question
Seaborn + Matplotlibimport matplotlib.pyplot as plt import seaborn as sns sns.set_style("whitegrid") sns.set_context("notebook") # ── 1. Single distribution ──────────────────────────── sns.histplot(df["SalePrice"], bins=40, kde=True, color="steelblue") # ── 2. Distribution by group ───────────────────────── # Boxplot: median, IQR, whiskers (1.5*IQR), outlier dots sns.boxplot(data=df, x="OverallQual", y="SalePrice") # Violinplot: boxplot + KDE density (shows bimodal shapes) sns.violinplot(data=df, x="OverallQual", y="SalePrice", inner="box") # ── 3. Two continuous variables ─────────────────────── sns.scatterplot(data=df, x="GrLivArea", y="SalePrice", hue="OverallQual", palette="YlOrRd", alpha=0.6) # Regression line + confidence interval sns.regplot(data=df, x="GrLivArea", y="SalePrice", scatter_kws={"alpha": 0.3}, line_kws={"color": "red"}) # ── 4. All pairwise relationships ──────────────────── # Slow on >10 cols. Subset to key features first. key = ["SalePrice", "GrLivArea", "OverallQual", "YearBuilt", "TotalBsmtSF"] sns.pairplot(df[key], diag_kind="kde", plot_kws={"alpha": 0.4}) # ── 5. Categorical frequency ───────────────────────── order = df["MSZoning"].value_counts().index sns.countplot(data=df, x="MSZoning", order=order) # ── 6. Categorical vs numeric ──────────────────────── order = df.groupby("Neighborhood")["SalePrice"].median().sort_values(ascending=False).index fig, ax = plt.subplots(figsize=(14, 5)) sns.barplot(data=df, x="Neighborhood", y="SalePrice", order=order, ax=ax) ax.tick_params(axis="x", rotation=45) # ── 7. Multi-panel summary ──────────────────────────── fig, axes = plt.subplots(2, 3, figsize=(16, 10)) sns.histplot(df["SalePrice"], kde=True, ax=axes[0, 0]).set_title("Price Distribution") sns.boxplot(data=df, x="OverallQual", y="SalePrice", ax=axes[0, 1]).set_title("Quality vs Price") sns.scatterplot(data=df, x="GrLivArea", y="SalePrice", alpha=0.3, ax=axes[0, 2]).set_title("Area vs Price") corr_top = df[key].corr() sns.heatmap(corr_top, annot=True, fmt=".2f", cmap="RdBu_r", center=0, ax=axes[1, 0]) sns.countplot(data=df, x="MSZoning", ax=axes[1, 1]).set_title("Zoning Distribution") sns.histplot(df["YearBuilt"], bins=40, ax=axes[1, 2]).set_title("Year Built") plt.tight_layout() plt.savefig("eda_summary.png", dpi=150, bbox_inches="tight")
Plot Selection Guide
histplot / kdeplot
Single continuous distribution with optional smoothing.
boxplot
Continuous vs categorical. Shows median, IQR, outliers.
violinplot
Boxplot + density. Reveals bimodal distributions.
scatterplot
Two continuous variables. Add hue= for 3rd dimension.
pairplot
All pairwise relationships. Comprehensive but slow.
heatmap
Correlation matrix. Best with annot=True for <15 features.
Systematic EDA Workflow
Processimport pandas as pd, numpy as np import matplotlib.pyplot as plt, seaborn as sns from scipy import stats # ════════════════════════════════════════════════════ # STEP 1: DATA INVENTORY # ════════════════════════════════════════════════════ print(f"Shape: {df.shape}") print("Dtypes:", df.dtypes.value_counts().to_dict()) null_pct = df.isnull().mean().sort_values(ascending=False) print("Columns with nulls:\n", null_pct[null_pct > 0]) numeric_cols = df.select_dtypes("number").columns.tolist() categorical_cols = df.select_dtypes("object").columns.tolist() # Missing value heatmap plt.figure(figsize=(14, 5)) sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap="viridis") plt.title("Missing Values (yellow = missing)") # ════════════════════════════════════════════════════ # STEP 2: TARGET VARIABLE ANALYSIS # ════════════════════════════════════════════════════ TARGET = "SalePrice" fig, axes = plt.subplots(1, 3, figsize=(15, 4)) sns.histplot(df[TARGET], kde=True, ax=axes[0]).set_title(f"Raw (skew={df[TARGET].skew():.2f})") stats.probplot(df[TARGET].dropna(), plot=axes[1]); axes[1].set_title("QQ Plot") log_t = np.log1p(df[TARGET]) sns.histplot(log_t, kde=True, ax=axes[2]).set_title(f"log() (skew={log_t.skew():.2f})") plt.tight_layout() # ════════════════════════════════════════════════════ # STEP 3: NUMERIC FEATURES # ════════════════════════════════════════════════════ # Distribution overview df[numeric_cols].hist(figsize=(20, 16), bins=30) plt.tight_layout() # Correlation with target tc = df[numeric_cols].corr()[TARGET].drop(TARGET).sort_values(ascending=False) plt.figure(figsize=(10, 8)) tc.plot(kind="barh", color=["green" if x > 0 else "red" for x in tc]) plt.title("Feature Correlation with SalePrice") plt.axvline(0, color="black", linewidth=1) # Scatter plots for top 6 features top6 = tc.abs().nlargest(6).index fig, axes = plt.subplots(2, 3, figsize=(15, 10)) for ax, feat in zip(axes.flat, top6): r, _ = stats.pearsonr(df[feat].dropna(), df[TARGET][df[feat].notna()]) ax.scatter(df[feat], df[TARGET], alpha=0.3, s=10) ax.set(xlabel=feat, ylabel=TARGET, title=f"r={r:.3f}") plt.tight_layout() # ════════════════════════════════════════════════════ # STEP 4: CATEGORICAL FEATURES # ════════════════════════════════════════════════════ for col in categorical_cols[:6]: n_cats = df[col].nunique() if n_cats > 15: continue # skip high-cardinality fig, ax = plt.subplots(figsize=(10, 4)) order = df.groupby(col)[TARGET].median().sort_values(ascending=False).index sns.boxplot(data=df, x=col, y=TARGET, order=order, ax=ax) ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right") ax.set_title(f"{col} vs {TARGET} (n_cats={n_cats})") plt.tight_layout()
Outlier Detection and Treatment
Data Qualityimport numpy as np from scipy import stats from sklearn.ensemble import IsolationForest # ── Method 1: IQR (most robust, skew-tolerant) ──────── def iqr_outlier_mask(series): q1, q3 = series.quantile(0.25), series.quantile(0.75) iqr = q3 - q1 return (series < q1 - 1.5 * iqr) | (series > q3 + 1.5 * iqr) price_outliers = iqr_outlier_mask(df["SalePrice"]) area_outliers = iqr_outlier_mask(df["GrLivArea"]) print(f"Price IQR outliers: {price_outliers.sum()}") print(f"Area IQR outliers: {area_outliers.sum()}") # Visualise with scatter: outliers in red fig, ax = plt.subplots(figsize=(8, 5)) both_outliers = price_outliers | area_outliers ax.scatter(df.loc[~both_outliers, "GrLivArea"], df.loc[~both_outliers, "SalePrice"], alpha=0.4, s=10, label="Normal") ax.scatter(df.loc[both_outliers, "GrLivArea"], df.loc[both_outliers, "SalePrice"], color="red", s=30, label="Outlier") ax.set(xlabel="GrLivArea", ylabel="SalePrice", title="IQR Outliers Flagged") ax.legend() # ── Method 2: Z-score (assumes normal distribution) ── z = np.abs(stats.zscore(df[["SalePrice", "GrLivArea"]].dropna())) z_outliers = (z > 3).any(axis=1) print(f"Z-score outliers (|z|>3): {z_outliers.sum()}") # ── Method 3: Isolation Forest (multivariate) ──────── numeric = df.select_dtypes("number").dropna() iso = IsolationForest(contamination=0.05, random_state=42) labels = iso.fit_predict(numeric) multi_outliers = labels == -1 print(f"Isolation Forest: {multi_outliers.sum()} outliers") # ── Treatment options ───────────────────────────────── # Option 1: Remove — only if clearly erroneous, not just extreme df_clean = df[~(area_outliers & price_outliers)].copy() print(f"After removal: {len(df_clean)} rows (was {len(df)})") # Option 2: Winsorise (cap at percentile) — preserves all rows lo, hi = df["SalePrice"].quantile([0.01, 0.99]) df["SalePrice_w"] = df["SalePrice"].clip(lo, hi) # Option 3: Log transform — mathematically compresses tails df["SalePrice_log"] = np.log1p(df["SalePrice"])
💡 Before removing outliers, always inspect them individually. In the House Prices dataset, there are two houses with GrLivArea > 4000 sqft but low SalePrice — these are partial sales, not data errors. Removing them changes model behaviour significantly. Check the Kaggle competition discussion before deleting rows.
FREE LEARNING RESOURCES
| Type | Resource | Best For |
|---|---|---|
| Course | Kaggle Data Visualisation Course (Free) — kaggle.com/learn/data-visualization | Best hands-on Seaborn & Matplotlib exercises. Interactive notebooks, immediate feedback. |
| Video | StatQuest — Statistics Fundamentals (YouTube) — youtube.com/c/joshstarmer | Best visual explanations of distributions, p-values, correlation, and statistical tests. No maths anxiety. |
| Docs | Seaborn Official Tutorial — seaborn.pydata.org/tutorial.html | Complete Seaborn reference with examples for every plot type. The authoritative source. |
| Docs | Matplotlib Gallery — matplotlib.org/stable/gallery | Hundreds of example plots with full copy-paste source code. Start from a working example. |
| Dataset | House Prices — Kaggle | Best EDA dataset: 79 features, interesting distributions, real-world messy data, Kaggle community. |
| Dataset | Titanic — Kaggle | Classic EDA dataset for categorical analysis and survival patterns. Well-documented community notebooks. |
| Course | Kaggle Feature Engineering Course — kaggle.com/learn/feature-engineering | Extends EDA into feature creation. Covers mutual information, encoding strategies. |
Conduct a full EDA on the House Prices dataset and produce a polished visual report.
10 Required Visualisations
- Target distribution: histogram + QQ plot of SalePrice, log-transformed version, skewness comparison
- Missing values: heatmap with percentage annotations
- Correlation: heatmap of top 15 features, bar chart of r-values with target
- Scatter with colour: GrLivArea vs SalePrice coloured by OverallQual
- Categorical analysis: boxplots of SalePrice by Neighborhood (sorted by median), OverallQual
- Distributions: histograms of top 5 right-skewed features before and after log transform
- Outlier plot: GrLivArea vs SalePrice with outliers highlighted in red
Deliverable: Jupyter notebook with all plots + 5 written insights that would guide modelling decisions.
EDA focused on categorical patterns and survival rate analysis.
- Survival rates by: Sex, Pclass, Embarked, Age group (10-year bins)
- Age distribution by survival status (overlapping histograms with alpha)
- Fare distribution — detect and annotate outliers
- Correlation heatmap for numeric features
- Conclusion: "Which 3 features would you include in a model and why?"
Descriptive Statistics Deep Dive
Correlation and Multicollinearity
Outlier Detection Comparison
P2-M05 MASTERY CHECKLIST
- Can compute mean, median, mode, std, IQR, skewness, kurtosis for any pandas Series
- Know when mean > median signals right skew and what treatment to apply
- Can interpret distribution shape from histogram: normal, right-skewed, left-skewed, bimodal
- Can apply log1p transform and verify it reduces skewness
- Can run Shapiro-Wilk normality test and interpret the p-value
- Can compute Pearson correlation matrix and identify the features most correlated with target
- Can find multicollinear feature pairs (|r| > 0.8) and understand why they are problematic
- Know the difference between Pearson and Spearman and when to use each
- Can produce: histplot, boxplot, violinplot, scatterplot, regplot, pairplot, heatmap, barplot, countplot
- Can choose the right plot for a given question and data type combination
- Can save publication-quality plots with savefig(dpi=150, bbox_inches='tight')
- Can follow the 4-phase EDA workflow: inventory → target → numeric → categorical
- Can detect outliers with IQR method, Z-score, and Isolation Forest
- Can choose between removing, capping (Winsorising), and log-transforming outliers
- Completed project: House Prices EDA report with 10 visualisations and written conclusions
✅ When complete: Move to P2-M06 — ML Workflow: feature engineering, scaling, encoding, train/test split, and sklearn Pipelines.