Part 3 — Classical ML  ·  Module 10 of 28
Unsupervised Learning: K-Means, PCA & t-SNE
Find hidden structure — clustering, dimensionality reduction, and visual exploration
⏱ 2 Weeks 🟡 Intermediate 🔧 scikit-learn · umap-learn · plotly 📋 Prerequisite: P3-M09
🎯

What This Module Covers

Part 3 Finale

Unsupervised learning finds structure in data without labels. It powers customer segmentation, anomaly detection, data compression, and visualisation of high-dimensional datasets. These techniques are used both standalone and as preprocessing steps for supervised models.

  • K-Means clustering — centroid-based, elbow method, silhouette score, cluster profiling
  • DBSCAN — density-based clustering, handles non-spherical clusters, detects noise
  • PCA (Principal Component Analysis) — dimensionality reduction, explained variance, noise removal
  • t-SNE — non-linear dimensionality reduction for visualisation
  • UMAP — faster than t-SNE, preserves global structure, good for production
  • Customer segmentation pipeline — full RFM analysis and business interpretation

💡 Unsupervised learning results are only as good as your interpretation. K-Means will always find K clusters — whether or not K clusters truly exist in the data. The hard work is validating that the clusters are meaningful, stable, and actionable for the business.

🎯

K-Means Clustering — From Basics to Production

Foundation
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import pandas as pd, numpy as np

# ── K-Means algorithm ─────────────────────────────────
# 1. Randomly initialise K centroids
# 2. Assign each point to the nearest centroid
# 3. Recompute centroids as mean of assigned points
# 4. Repeat 2-3 until convergence (centroids don't move)
# CRITICAL: K-Means requires feature scaling!

df_seg = pd.read_csv("mall_customers.csv")
features = ["Annual Income (k$)", "Spending Score (1-100)", "Age"]
X = df_seg[features]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ── Finding optimal K ─────────────────────────────────
# Method 1: Elbow method — plot inertia (WCSS) vs K
inertias = []
sil_scores = []
K_range = range(2, 11)

for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)                    # Within-Cluster Sum of Squares
    sil_scores.append(silhouette_score(X_scaled, km.labels_))  # silhouette

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(K_range, inertias, marker="o")
axes[0].set(xlabel="K", ylabel="Inertia (WCSS)", title="Elbow Method")
axes[0].axvline(5, color="red", linestyle="--", label="Chosen K")
axes[1].plot(K_range, sil_scores, marker="o", color="green")
axes[1].set(xlabel="K", ylabel="Silhouette Score", title="Silhouette (higher=better)")
# Silhouette: -1 to 1; near 1 = well separated; near 0 = overlapping

# ── Final model ───────────────────────────────────────
best_k = 5
km_final = KMeans(n_clusters=best_k, n_init=20, random_state=42)
df_seg["Cluster"] = km_final.fit_predict(X_scaled)

# ── Cluster profiling ─────────────────────────────────
cluster_profile = df_seg.groupby("Cluster")[features].mean().round(1)
cluster_sizes   = df_seg["Cluster"].value_counts().sort_index()
cluster_profile["Size"] = cluster_sizes
print(cluster_profile)

# Visualise clusters in 2D (Income vs Spending)
plt.figure(figsize=(8, 6))
for cluster_id in range(best_k):
    mask = df_seg["Cluster"] == cluster_id
    plt.scatter(df_seg.loc[mask, "Annual Income (k$)"],
                df_seg.loc[mask, "Spending Score (1-100)"],
                label=f"Cluster {cluster_id}", s=60, alpha=0.7)
# Plot centroids
centroids = scaler.inverse_transform(km_final.cluster_centers_)
plt.scatter(centroids[:, 0], centroids[:, 1],
            c="black", s=200, marker="X", label="Centroids", zorder=5)
plt.legend()
plt.title(f"K-Means Clusters (K={best_k})")

⚠️ K-Means assumptions: clusters are spherical (equal shape), have similar sizes, and similar densities. Real-world clusters are often none of these. If your scatter plot shows elongated or irregular clusters, use DBSCAN or Gaussian Mixture Models instead. Always visualise before trusting K-Means results.

🏔

DBSCAN — Density-Based Clustering

Arbitrary Shapes
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

# ── DBSCAN concept ────────────────────────────────────
# A point is a "core point" if it has at least min_samples
# neighbours within radius eps
# Clusters grow by connecting core points within eps
# Points that can't be reached: noise (label = -1)
# Advantages: finds arbitrary-shaped clusters, detects noise
# Parameters: eps (neighbourhood radius), min_samples (density threshold)

X_scaled = StandardScaler().fit_transform(X)

# ── Find optimal eps: k-distance plot ────────────────
# For each point, find its 4th nearest neighbour distance
# eps = knee of the sorted distance plot
k = 4  # min_samples - 1
nbrs = NearestNeighbors(n_neighbors=k).fit(X_scaled)
distances, _ = nbrs.kneighbors(X_scaled)
k_distances = np.sort(distances[:, k-1])[::-1]

plt.figure(figsize=(8, 4))
plt.plot(k_distances)
plt.ylabel("4th Nearest Neighbour Distance")
plt.xlabel("Points sorted by distance")
plt.title("K-Distance Plot: Knee = optimal eps")
# Knee of the curve ≈ optimal eps value

# ── Fit DBSCAN ────────────────────────────────────────
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise    = list(labels).count(-1)
print(f"Clusters: {n_clusters}")
print(f"Noise points: {n_noise} ({n_noise/len(labels):.1%})")

# Visualise
plt.figure(figsize=(8, 6))
unique_labels = set(labels)
colours = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colours):
    if k == -1: col = "black"  # noise = black
    mask = labels == k
    plt.scatter(X_scaled[mask, 0], X_scaled[mask, 1],
                c=[col], s=15, alpha=0.7,
                label=f"{'Noise' if k == -1 else f'Cluster {k}'}")
plt.legend(fontsize=8)
plt.title(f"DBSCAN: {n_clusters} clusters, {n_noise} noise points")
📉

PCA — Dimensionality Reduction

Linear DR
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

# ── PCA concept ───────────────────────────────────────
# Finds directions (principal components) of maximum variance
# PC1: direction of highest variance in the data
# PC2: direction of second highest variance, orthogonal to PC1
# Each PC is a linear combination of original features
# Use cases:
#   1. Visualisation: reduce to 2D for scatter plots
#   2. Noise removal: drop low-variance components
#   3. Feature compression: fewer features before SVM or kNN
#   4. Multicollinearity removal for linear models

X = df.select_dtypes("number").dropna()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ── Fit PCA ───────────────────────────────────────────
pca = PCA()  # keep all components to see explained variance
pca.fit(X_scaled)

# ── Explained variance plot ──────────────────────────
explained_var = pca.explained_variance_ratio_
cumulative_var = np.cumsum(explained_var)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].bar(range(1, len(explained_var)+1), explained_var)
axes[0].set(xlabel="Principal Component", ylabel="Explained Variance Ratio",
             title="Variance per Component")
axes[1].plot(range(1, len(cumulative_var)+1), cumulative_var, marker="o")
axes[1].axhline(0.95, color="red", linestyle="--", label="95% threshold")
axes[1].set(xlabel="N Components", ylabel="Cumulative Explained Variance",
             title="How many components for 95% variance?")
axes[1].legend()

# How many components to retain 95% variance?
n_components_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"Components for 95% variance: {n_components_95} / {X_scaled.shape[1]}")

# ── Apply PCA reduction ───────────────────────────────
pca_2 = PCA(n_components=2)
X_2d = pca_2.fit_transform(X_scaled)

pca_95 = PCA(n_components=0.95)  # keep 95% variance automatically
X_95 = pca_95.fit_transform(X_scaled)
print(f"Reduced from {X_scaled.shape[1]} to {X_95.shape[1]} features (95% variance)")

# ── Visualise in 2D ───────────────────────────────────
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1],
                      c=df["SalePrice"], cmap="viridis", s=10, alpha=0.5)
plt.colorbar(scatter, label="SalePrice")
plt.xlabel(f"PC1 ({explained_var[0]:.1%} variance)")
plt.ylabel(f"PC2 ({explained_var[1]:.1%} variance)")
plt.title("House Prices: 2D PCA Projection")

# ── Feature loadings ──────────────────────────────────
loadings = pd.DataFrame(pca.components_.T,
                         index=X.columns,
                         columns=[f"PC{i+1}" for i in range(len(X.columns))])
print("Top features for PC1 (high |loading| = strong contributor):")
print(loadings["PC1"].abs().sort_values(ascending=False).head(5))

💡 PCA is linear — it only captures linear relationships. If your data has non-linear structure (spiral, ring, Swiss roll), PCA will distort it. In that case, use t-SNE or UMAP for visualisation, and kernel PCA for preprocessing.

🌀

t-SNE and UMAP — Non-Linear Visualisation

Visualisation
from sklearn.manifold import TSNE
import umap  # pip install umap-learn
import matplotlib.pyplot as plt
import numpy as np

# ── t-SNE ─────────────────────────────────────────────
# Converts high-dimensional distances to probabilities
# Places points in 2D to match those probability distributions
# PRESERVES LOCAL structure (nearby points stay nearby)
# Does NOT preserve global distances (far points are unreliable)
# Main parameter: perplexity (5-50; ~= number of neighbours considered)
# WARNING: every run looks different! Set random_state for reproducibility
# WARNING: slow for N > 10,000. Apply PCA first.

# Best practice: PCA to 50 dims first (faster, denoises)
pca50 = PCA(n_components=50)
X_pca50 = pca50.fit_transform(X_scaled)

tsne = TSNE(
    n_components=2,
    perplexity=30,          # 5-50; higher = more global
    learning_rate="auto",
    n_iter=1000,
    init="pca",             # better initialisation than random
    random_state=42,
)
X_tsne = tsne.fit_transform(X_pca50)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1],
                      c=y_clusters, cmap="tab10", s=8, alpha=0.7)
plt.title("t-SNE Visualisation (perplexity=30)")
plt.colorbar(scatter, label="Cluster")

# ── UMAP (faster + preserves global structure better) ─
# Pros over t-SNE:
#   - Much faster (10-100× on large datasets)
#   - Preserves both local AND global structure
#   - Deterministic with random_state
#   - Can transform new data (transform method, unlike t-SNE)
# Main parameters:
#   n_neighbors: local neighbourhood size (5-50)
#   min_dist: how tightly to pack clusters (0.0-1.0)

reducer = umap.UMAP(
    n_components=2,
    n_neighbors=15,     # 5=local, 50=global structure
    min_dist=0.1,       # 0.0=tighter clusters, 1.0=more spread
    random_state=42,
    n_jobs=-1,
)
X_umap = reducer.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1],
                      c=y_clusters, cmap="tab10", s=8, alpha=0.7)
plt.title("UMAP Visualisation")

# UMAP can transform new data (t-SNE cannot)
new_data_2d = reducer.transform(X_test_scaled)  # project test set

⚠️ t-SNE and UMAP are for visualisation only. The 2D coordinates have no absolute meaning — distances between cluster groups are not interpretable. Cluster A appearing "close" to cluster B in t-SNE does not mean they are similar. Use these only to verify that clusters exist visually, not to measure cluster similarity.

📊

Cluster Evaluation Metrics

Evaluation
from sklearn.metrics import (silhouette_score, davies_bouldin_score,
                             calinski_harabasz_score, adjusted_rand_score)
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

# ── Internal metrics (no labels needed) ──────────────

# 1. Silhouette Score: -1 to +1
# a(i) = mean distance to own cluster members
# b(i) = mean distance to nearest OTHER cluster members
# silhouette(i) = (b(i) - a(i)) / max(a(i), b(i))
# Near +1: point is well-separated from other clusters
# Near  0: point is on the boundary between clusters
# Near -1: point may be in the wrong cluster
sil = silhouette_score(X_scaled, labels)
print(f"Silhouette Score: {sil:.4f}  (higher is better, max 1.0)")

# 2. Davies-Bouldin Index: lower is better
# Average ratio of intra-cluster to inter-cluster distance
db = davies_bouldin_score(X_scaled, labels)
print(f"Davies-Bouldin:   {db:.4f}  (lower is better)")

# 3. Calinski-Harabasz (Variance Ratio): higher is better
ch = calinski_harabasz_score(X_scaled, labels)
print(f"Calinski-Harabasz: {ch:.1f}  (higher is better)")

# ── External metrics (need true labels) ───────────────
# Use when you KNOW the true clusters (e.g. from a ground truth)
# Adjusted Rand Index: -0.5 to 1.0 (1.0 = perfect agreement)
ari = adjusted_rand_score(y_true, labels)
print(f"Adjusted Rand Index: {ari:.4f}")

# ── Plot silhouette scores per cluster ────────────────
from sklearn.metrics import silhouette_samples
sample_sil = silhouette_samples(X_scaled, labels)
fig, ax = plt.subplots(figsize=(8, 5))
y_lower = 10
for k in range(n_clusters):
    cluster_sil = np.sort(sample_sil[labels == k])
    y_upper = y_lower + len(cluster_sil)
    ax.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_sil)
    y_lower = y_upper + 5
ax.axvline(sil, color="red", linestyle="--", label=f"Mean = {sil:.3f}")
ax.set(xlabel="Silhouette Coefficient", title="Silhouette Plot per Cluster")
ax.legend()
🏢

RFM Customer Segmentation Pipeline

Real-World Application
import pandas as pd, numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# ── RFM Analysis — standard segmentation framework ───
# R (Recency):   How recently did the customer purchase?
# F (Frequency): How often do they purchase?
# M (Monetary):  How much do they spend?

df_orders = pd.read_csv("online_retail.csv")
df_orders["InvoiceDate"] = pd.to_datetime(df_orders["InvoiceDate"])
df_orders["Revenue"] = df_orders["Quantity"] * df_orders["UnitPrice"]

# Reference date: day after last transaction
reference = df_orders["InvoiceDate"].max() + pd.Timedelta(days=1)

rfm = df_orders.groupby("CustomerID").agg(
    Recency   = ("InvoiceDate", lambda x: (reference - x.max()).days),
    Frequency = ("InvoiceNo", "nunique"),
    Monetary  = ("Revenue", "sum")
).reset_index()

# Remove outliers
rfm = rfm[(rfm["Monetary"] > 0) & (rfm["Frequency"] > 0)]

# ── Scale and cluster ─────────────────────────────────
scaler = StandardScaler()
X_rfm = scaler.fit_transform(rfm[["Recency", "Frequency", "Monetary"]])

# Find K using elbow + silhouette
silhouettes = []
for k in range(2, 8):
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_rfm)
    silhouettes.append(silhouette_score(X_rfm, labels))

best_k = silhouettes.index(max(silhouettes)) + 2
print(f"Best K: {best_k}")

km = KMeans(n_clusters=best_k, n_init=20, random_state=42)
rfm["Cluster"] = km.fit_predict(X_rfm)

# ── Profile clusters ──────────────────────────────────
profile = rfm.groupby("Cluster")[["Recency", "Frequency", "Monetary"]].mean().round(1)
profile["Size"] = rfm["Cluster"].value_counts().sort_index()
print(profile)

# Interpret:
# Cluster A: Low Recency, High Freq, High Monetary → Champions
# Cluster B: High Recency, Low Freq, Low Monetary  → At Risk
# Cluster C: Medium Recency, Medium Freq            → Potential Loyalists

# ── PCA visualisation of RFM clusters ────────────────
pca2 = PCA(n_components=2)
X_2d = pca2.fit_transform(X_rfm)
plt.figure(figsize=(8, 6))
for c in rfm["Cluster"].unique():
    mask = rfm["Cluster"] == c
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1], label=f"Cluster {c}", s=10, alpha=0.5)
plt.legend()
plt.title(f"RFM Clusters (K={best_k}, PCA 2D)")

FREE LEARNING RESOURCES

TypeResourceBest For
VideoStatQuest — K-Means, PCA, t-SNE (YouTube)Best visual explanations of how each algorithm works. Highly recommended for building intuition.
DocsScikit-learn Clustering Guide — scikit-learn.org/stable/modules/clustering.htmlAll sklearn clustering algorithms with comparison table, parameters, and use cases.
DocsUMAP Documentation — umap-learn.readthedocs.ioUMAP parameters, comparison to t-SNE, and production use cases.
DatasetMall Customers — KaggleClassic customer segmentation dataset. Small, visual, perfect for K-Means exploration.
DatasetOlist Brazilian E-Commerce — KaggleReal e-commerce data for full RFM analysis. Multiple tables to join and explore.
🛠Customer Segmentation + Churn Prediction[Advanced] 6–7 days

A two-part project combining unsupervised and supervised ML.

Part A — Customer Segmentation (3–4 days)

  • Load Telco Churn or Mall Customers dataset
  • Compute RFM features (or use existing features)
  • Use elbow + silhouette to find optimal K
  • Profile each cluster with mean feature values
  • Visualise with PCA + t-SNE side by side
  • Name each segment: "High-Value Loyalists", "At-Risk Churners", etc.

Part B — Churn Prediction (2–3 days)

  • Add cluster labels as a feature to the churn prediction dataset
  • Train XGBoost with and without cluster feature
  • Does the cluster feature improve ROC-AUC?
  • Which cluster has the highest churn rate?

Deliverable: Jupyter notebook with all plots + 1-paragraph business recommendation per segment.

🎉 Part 3 Complete!

You've completed all 6 Classical ML modules (P2-M05 through P3-M10). You can now EDA any dataset, build regression and classification models, use gradient boosting, tune hyperparameters, and find hidden structure. Next: Part 4 — LLM APIs, where the AI Engineering path begins.

LAB 1

Finding the Right K

1
On Mall Customers, plot elbow curve (inertia) and silhouette score for K=2 to 10. Do both methods agree on the best K? What happens when K is too large?
2
Fit K-Means with the optimal K. Profile each cluster: what is the mean Income, Spending Score, and Age? Give each cluster a descriptive name (e.g., "Budget Conscious Young Adults").
3
Run K-Means 5 times with different random_state values (same K). Do the cluster assignments change? Does the silhouette score change? What does this tell you about K-Means stability?
LAB 2

PCA for Preprocessing

1
Apply PCA to House Prices numeric features. How many components are needed to explain 95% variance? How many for 99%?
2
Train Ridge regression on: (a) all original features, (b) PCA 95% components, (c) PCA 50% components. Compare CV RMSE. Does PCA improve or hurt performance?
3
Plot the loadings for PC1 and PC2. Which original features contribute most to each component? Does this match your intuition from the EDA (M05)?

P3-M10 MASTERY CHECKLIST

Part 3 Complete! You have all the Classical ML foundations. Move to Part 4 — LLM APIs to start the AI Engineering path: prompting, structured outputs, streaming, and reliability.

← P3-M09: Ensembles 🗺️ All Modules Next: P4-M11 — Prompting →