P3-M10 - Unsupervised Learning: K-Means, PCA & t-SNE

Part 3 — Classical ML · Module 10 of 28

Unsupervised Learning: K-Means, PCA & t-SNE

Find hidden structure — clustering, dimensionality reduction, and visual exploration

⏱ 2 Weeks 🟡 Intermediate 🔧 scikit-learn · umap-learn · plotly 📋 Prerequisite: P3-M09

🎯

What This Module Covers

Part 3 Finale

Unsupervised learning finds structure in data without labels. It powers customer segmentation, anomaly detection, data compression, and visualisation of high-dimensional datasets. These techniques are used both standalone and as preprocessing steps for supervised models.

K-Means clustering — centroid-based, elbow method, silhouette score, cluster profiling
DBSCAN — density-based clustering, handles non-spherical clusters, detects noise
PCA (Principal Component Analysis) — dimensionality reduction, explained variance, noise removal
t-SNE — non-linear dimensionality reduction for visualisation
UMAP — faster than t-SNE, preserves global structure, good for production
Customer segmentation pipeline — full RFM analysis and business interpretation

💡 Unsupervised learning results are only as good as your interpretation. K-Means will always find K clusters — whether or not K clusters truly exist in the data. The hard work is validating that the clusters are meaningful, stable, and actionable for the business.

🎯

K-Means Clustering — From Basics to Production

Foundation

from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import pandas as pd, numpy as np

# ── K-Means algorithm ─────────────────────────────────
# 1. Randomly initialise K centroids
# 2. Assign each point to the nearest centroid
# 3. Recompute centroids as mean of assigned points
# 4. Repeat 2-3 until convergence (centroids don't move)
# CRITICAL: K-Means requires feature scaling!

df_seg = pd.read_csv("mall_customers.csv")
features = ["Annual Income (k$)", "Spending Score (1-100)", "Age"]
X = df_seg[features]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ── Finding optimal K ─────────────────────────────────
# Method 1: Elbow method — plot inertia (WCSS) vs K
inertias = []
sil_scores = []
K_range = range(2, 11)

for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)                    # Within-Cluster Sum of Squares
    sil_scores.append(silhouette_score(X_scaled, km.labels_))  # silhouette

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(K_range, inertias, marker="o")
axes[0].set(xlabel="K", ylabel="Inertia (WCSS)", title="Elbow Method")
axes[0].axvline(5, color="red", linestyle="--", label="Chosen K")
axes[1].plot(K_range, sil_scores, marker="o", color="green")
axes[1].set(xlabel="K", ylabel="Silhouette Score", title="Silhouette (higher=better)")
# Silhouette: -1 to 1; near 1 = well separated; near 0 = overlapping

# ── Final model ───────────────────────────────────────
best_k = 5
km_final = KMeans(n_clusters=best_k, n_init=20, random_state=42)
df_seg["Cluster"] = km_final.fit_predict(X_scaled)

# ── Cluster profiling ─────────────────────────────────
cluster_profile = df_seg.groupby("Cluster")[features].mean().round(1)
cluster_sizes   = df_seg["Cluster"].value_counts().sort_index()
cluster_profile["Size"] = cluster_sizes
print(cluster_profile)

# Visualise clusters in 2D (Income vs Spending)
plt.figure(figsize=(8, 6))
for cluster_id in range(best_k):
    mask = df_seg["Cluster"] == cluster_id
    plt.scatter(df_seg.loc[mask, "Annual Income (k$)"],
                df_seg.loc[mask, "Spending Score (1-100)"],
                label=f"Cluster {cluster_id}", s=60, alpha=0.7)
# Plot centroids
centroids = scaler.inverse_transform(km_final.cluster_centers_)
plt.scatter(centroids[:, 0], centroids[:, 1],
            c="black", s=200, marker="X", label="Centroids", zorder=5)
plt.legend()
plt.title(f"K-Means Clusters (K={best_k})")

⚠️ K-Means assumptions: clusters are spherical (equal shape), have similar sizes, and similar densities. Real-world clusters are often none of these. If your scatter plot shows elongated or irregular clusters, use DBSCAN or Gaussian Mixture Models instead. Always visualise before trusting K-Means results.

🏔

DBSCAN — Density-Based Clustering

Arbitrary Shapes

from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

# ── DBSCAN concept ────────────────────────────────────
# A point is a "core point" if it has at least min_samples
# neighbours within radius eps
# Clusters grow by connecting core points within eps
# Points that can't be reached: noise (label = -1)
# Advantages: finds arbitrary-shaped clusters, detects noise
# Parameters: eps (neighbourhood radius), min_samples (density threshold)

X_scaled = StandardScaler().fit_transform(X)

# ── Find optimal eps: k-distance plot ────────────────
# For each point, find its 4th nearest neighbour distance
# eps = knee of the sorted distance plot
k = 4  # min_samples - 1
nbrs = NearestNeighbors(n_neighbors=k).fit(X_scaled)
distances, _ = nbrs.kneighbors(X_scaled)
k_distances = np.sort(distances[:, k-1])[::-1]

plt.figure(figsize=(8, 4))
plt.plot(k_distances)
plt.ylabel("4th Nearest Neighbour Distance")
plt.xlabel("Points sorted by distance")
plt.title("K-Distance Plot: Knee = optimal eps")
# Knee of the curve ≈ optimal eps value

# ── Fit DBSCAN ────────────────────────────────────────
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise    = list(labels).count(-1)
print(f"Clusters: {n_clusters}")
print(f"Noise points: {n_noise} ({n_noise/len(labels):.1%})")

# Visualise
plt.figure(figsize=(8, 6))
unique_labels = set(labels)
colours = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
for k, col in zip(unique_labels, colours):
    if k == -1: col = "black"  # noise = black
    mask = labels == k
    plt.scatter(X_scaled[mask, 0], X_scaled[mask, 1],
                c=[col], s=15, alpha=0.7,
                label=f"{'Noise' if k == -1 else f'Cluster {k}'}")
plt.legend(fontsize=8)
plt.title(f"DBSCAN: {n_clusters} clusters, {n_noise} noise points")

📉

PCA — Dimensionality Reduction

Linear DR

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np

# ── PCA concept ───────────────────────────────────────
# Finds directions (principal components) of maximum variance
# PC1: direction of highest variance in the data
# PC2: direction of second highest variance, orthogonal to PC1
# Each PC is a linear combination of original features
# Use cases:
#   1. Visualisation: reduce to 2D for scatter plots
#   2. Noise removal: drop low-variance components
#   3. Feature compression: fewer features before SVM or kNN
#   4. Multicollinearity removal for linear models

X = df.select_dtypes("number").dropna()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ── Fit PCA ───────────────────────────────────────────
pca = PCA()  # keep all components to see explained variance
pca.fit(X_scaled)

# ── Explained variance plot ──────────────────────────
explained_var = pca.explained_variance_ratio_
cumulative_var = np.cumsum(explained_var)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].bar(range(1, len(explained_var)+1), explained_var)
axes[0].set(xlabel="Principal Component", ylabel="Explained Variance Ratio",
             title="Variance per Component")
axes[1].plot(range(1, len(cumulative_var)+1), cumulative_var, marker="o")
axes[1].axhline(0.95, color="red", linestyle="--", label="95% threshold")
axes[1].set(xlabel="N Components", ylabel="Cumulative Explained Variance",
             title="How many components for 95% variance?")
axes[1].legend()

# How many components to retain 95% variance?
n_components_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"Components for 95% variance: {n_components_95} / {X_scaled.shape[1]}")

# ── Apply PCA reduction ───────────────────────────────
pca_2 = PCA(n_components=2)
X_2d = pca_2.fit_transform(X_scaled)

pca_95 = PCA(n_components=0.95)  # keep 95% variance automatically
X_95 = pca_95.fit_transform(X_scaled)
print(f"Reduced from {X_scaled.shape[1]} to {X_95.shape[1]} features (95% variance)")

# ── Visualise in 2D ───────────────────────────────────
plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1],
                      c=df["SalePrice"], cmap="viridis", s=10, alpha=0.5)
plt.colorbar(scatter, label="SalePrice")
plt.xlabel(f"PC1 ({explained_var[0]:.1%} variance)")
plt.ylabel(f"PC2 ({explained_var[1]:.1%} variance)")
plt.title("House Prices: 2D PCA Projection")

# ── Feature loadings ──────────────────────────────────
loadings = pd.DataFrame(pca.components_.T,
                         index=X.columns,
                         columns=[f"PC{i+1}" for i in range(len(X.columns))])
print("Top features for PC1 (high |loading| = strong contributor):")
print(loadings["PC1"].abs().sort_values(ascending=False).head(5))

💡 PCA is linear — it only captures linear relationships. If your data has non-linear structure (spiral, ring, Swiss roll), PCA will distort it. In that case, use t-SNE or UMAP for visualisation, and kernel PCA for preprocessing.

🌀

t-SNE and UMAP — Non-Linear Visualisation

Visualisation

from sklearn.manifold import TSNE
import umap  # pip install umap-learn
import matplotlib.pyplot as plt
import numpy as np

# ── t-SNE ─────────────────────────────────────────────
# Converts high-dimensional distances to probabilities
# Places points in 2D to match those probability distributions
# PRESERVES LOCAL structure (nearby points stay nearby)
# Does NOT preserve global distances (far points are unreliable)
# Main parameter: perplexity (5-50; ~= number of neighbours considered)
# WARNING: every run looks different! Set random_state for reproducibility
# WARNING: slow for N > 10,000. Apply PCA first.

# Best practice: PCA to 50 dims first (faster, denoises)
pca50 = PCA(n_components=50)
X_pca50 = pca50.fit_transform(X_scaled)

tsne = TSNE(
    n_components=2,
    perplexity=30,          # 5-50; higher = more global
    learning_rate="auto",
    n_iter=1000,
    init="pca",             # better initialisation than random
    random_state=42,
)
X_tsne = tsne.fit_transform(X_pca50)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1],
                      c=y_clusters, cmap="tab10", s=8, alpha=0.7)
plt.title("t-SNE Visualisation (perplexity=30)")
plt.colorbar(scatter, label="Cluster")

# ── UMAP (faster + preserves global structure better) ─
# Pros over t-SNE:
#   - Much faster (10-100× on large datasets)
#   - Preserves both local AND global structure
#   - Deterministic with random_state
#   - Can transform new data (transform method, unlike t-SNE)
# Main parameters:
#   n_neighbors: local neighbourhood size (5-50)
#   min_dist: how tightly to pack clusters (0.0-1.0)

reducer = umap.UMAP(
    n_components=2,
    n_neighbors=15,     # 5=local, 50=global structure
    min_dist=0.1,       # 0.0=tighter clusters, 1.0=more spread
    random_state=42,
    n_jobs=-1,
)
X_umap = reducer.fit_transform(X_scaled)

plt.figure(figsize=(8, 6))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1],
                      c=y_clusters, cmap="tab10", s=8, alpha=0.7)
plt.title("UMAP Visualisation")

# UMAP can transform new data (t-SNE cannot)
new_data_2d = reducer.transform(X_test_scaled)  # project test set

⚠️ t-SNE and UMAP are for visualisation only. The 2D coordinates have no absolute meaning — distances between cluster groups are not interpretable. Cluster A appearing "close" to cluster B in t-SNE does not mean they are similar. Use these only to verify that clusters exist visually, not to measure cluster similarity.

📊

Cluster Evaluation Metrics

Evaluation

from sklearn.metrics import (silhouette_score, davies_bouldin_score,
                             calinski_harabasz_score, adjusted_rand_score)
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np

# ── Internal metrics (no labels needed) ──────────────

# 1. Silhouette Score: -1 to +1
# a(i) = mean distance to own cluster members
# b(i) = mean distance to nearest OTHER cluster members
# silhouette(i) = (b(i) - a(i)) / max(a(i), b(i))
# Near +1: point is well-separated from other clusters
# Near  0: point is on the boundary between clusters
# Near -1: point may be in the wrong cluster
sil = silhouette_score(X_scaled, labels)
print(f"Silhouette Score: {sil:.4f}  (higher is better, max 1.0)")

# 2. Davies-Bouldin Index: lower is better
# Average ratio of intra-cluster to inter-cluster distance
db = davies_bouldin_score(X_scaled, labels)
print(f"Davies-Bouldin:   {db:.4f}  (lower is better)")

# 3. Calinski-Harabasz (Variance Ratio): higher is better
ch = calinski_harabasz_score(X_scaled, labels)
print(f"Calinski-Harabasz: {ch:.1f}  (higher is better)")

# ── External metrics (need true labels) ───────────────
# Use when you KNOW the true clusters (e.g. from a ground truth)
# Adjusted Rand Index: -0.5 to 1.0 (1.0 = perfect agreement)
ari = adjusted_rand_score(y_true, labels)
print(f"Adjusted Rand Index: {ari:.4f}")

# ── Plot silhouette scores per cluster ────────────────
from sklearn.metrics import silhouette_samples
sample_sil = silhouette_samples(X_scaled, labels)
fig, ax = plt.subplots(figsize=(8, 5))
y_lower = 10
for k in range(n_clusters):
    cluster_sil = np.sort(sample_sil[labels == k])
    y_upper = y_lower + len(cluster_sil)
    ax.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_sil)
    y_lower = y_upper + 5
ax.axvline(sil, color="red", linestyle="--", label=f"Mean = {sil:.3f}")
ax.set(xlabel="Silhouette Coefficient", title="Silhouette Plot per Cluster")
ax.legend()

🏢

RFM Customer Segmentation Pipeline

Real-World Application

import pandas as pd, numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# ── RFM Analysis — standard segmentation framework ───
# R (Recency):   How recently did the customer purchase?
# F (Frequency): How often do they purchase?
# M (Monetary):  How much do they spend?

df_orders = pd.read_csv("online_retail.csv")
df_orders["InvoiceDate"] = pd.to_datetime(df_orders["InvoiceDate"])
df_orders["Revenue"] = df_orders["Quantity"] * df_orders["UnitPrice"]

# Reference date: day after last transaction
reference = df_orders["InvoiceDate"].max() + pd.Timedelta(days=1)

rfm = df_orders.groupby("CustomerID").agg(
    Recency   = ("InvoiceDate", lambda x: (reference - x.max()).days),
    Frequency = ("InvoiceNo", "nunique"),
    Monetary  = ("Revenue", "sum")
).reset_index()

# Remove outliers
rfm = rfm[(rfm["Monetary"] > 0) & (rfm["Frequency"] > 0)]

# ── Scale and cluster ─────────────────────────────────
scaler = StandardScaler()
X_rfm = scaler.fit_transform(rfm[["Recency", "Frequency", "Monetary"]])

# Find K using elbow + silhouette
silhouettes = []
for k in range(2, 8):
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_rfm)
    silhouettes.append(silhouette_score(X_rfm, labels))

best_k = silhouettes.index(max(silhouettes)) + 2
print(f"Best K: {best_k}")

km = KMeans(n_clusters=best_k, n_init=20, random_state=42)
rfm["Cluster"] = km.fit_predict(X_rfm)

# ── Profile clusters ──────────────────────────────────
profile = rfm.groupby("Cluster")[["Recency", "Frequency", "Monetary"]].mean().round(1)
profile["Size"] = rfm["Cluster"].value_counts().sort_index()
print(profile)

# Interpret:
# Cluster A: Low Recency, High Freq, High Monetary → Champions
# Cluster B: High Recency, Low Freq, Low Monetary  → At Risk
# Cluster C: Medium Recency, Medium Freq            → Potential Loyalists

# ── PCA visualisation of RFM clusters ────────────────
pca2 = PCA(n_components=2)
X_2d = pca2.fit_transform(X_rfm)
plt.figure(figsize=(8, 6))
for c in rfm["Cluster"].unique():
    mask = rfm["Cluster"] == c
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1], label=f"Cluster {c}", s=10, alpha=0.5)
plt.legend()
plt.title(f"RFM Clusters (K={best_k}, PCA 2D)")

FREE LEARNING RESOURCES

Type	Resource	Best For
Video	StatQuest — K-Means, PCA, t-SNE (YouTube)	Best visual explanations of how each algorithm works. Highly recommended for building intuition.
Docs	Scikit-learn Clustering Guide — scikit-learn.org/stable/modules/clustering.html	All sklearn clustering algorithms with comparison table, parameters, and use cases.
Docs	UMAP Documentation — umap-learn.readthedocs.io	UMAP parameters, comparison to t-SNE, and production use cases.
Dataset	Mall Customers — Kaggle	Classic customer segmentation dataset. Small, visual, perfect for K-Means exploration.
Dataset	Olist Brazilian E-Commerce — Kaggle	Real e-commerce data for full RFM analysis. Multiple tables to join and explore.

🛠Customer Segmentation + Churn Prediction[Advanced] 6–7 days

A two-part project combining unsupervised and supervised ML.

Part A — Customer Segmentation (3–4 days)

Load Telco Churn or Mall Customers dataset
Compute RFM features (or use existing features)
Use elbow + silhouette to find optimal K
Profile each cluster with mean feature values
Visualise with PCA + t-SNE side by side
Name each segment: "High-Value Loyalists", "At-Risk Churners", etc.

Part B — Churn Prediction (2–3 days)

Add cluster labels as a feature to the churn prediction dataset
Train XGBoost with and without cluster feature
Does the cluster feature improve ROC-AUC?
Which cluster has the highest churn rate?

Deliverable: Jupyter notebook with all plots + 1-paragraph business recommendation per segment.

🎉 Part 3 Complete!

You've completed all 6 Classical ML modules (P2-M05 through P3-M10). You can now EDA any dataset, build regression and classification models, use gradient boosting, tune hyperparameters, and find hidden structure. Next: Part 4 — LLM APIs, where the AI Engineering path begins.

LAB 1

Finding the Right K

On Mall Customers, plot elbow curve (inertia) and silhouette score for K=2 to 10. Do both methods agree on the best K? What happens when K is too large?

Fit K-Means with the optimal K. Profile each cluster: what is the mean Income, Spending Score, and Age? Give each cluster a descriptive name (e.g., "Budget Conscious Young Adults").

Run K-Means 5 times with different random_state values (same K). Do the cluster assignments change? Does the silhouette score change? What does this tell you about K-Means stability?

LAB 2

PCA for Preprocessing

Apply PCA to House Prices numeric features. How many components are needed to explain 95% variance? How many for 99%?

Train Ridge regression on: (a) all original features, (b) PCA 95% components, (c) PCA 50% components. Compare CV RMSE. Does PCA improve or hurt performance?

Plot the loadings for PC1 and PC2. Which original features contribute most to each component? Does this match your intuition from the EDA (M05)?

P3-M10 MASTERY CHECKLIST

Can explain K-Means: centroid initialisation, assignment step, update step, convergence
Know that K-Means requires feature scaling and assumes spherical clusters
Can use elbow method and silhouette score to find optimal K
Can profile and name clusters with groupby().mean() and visualisations
Can explain DBSCAN parameters: eps and min_samples
Know DBSCAN handles non-spherical clusters and detects noise points
Can explain PCA: principal components are directions of maximum variance
Can create and interpret an explained variance plot
Can reduce to 2D for visualisation and to 95% variance for preprocessing
Know t-SNE preserves local structure but not global; use only for visualisation
Know UMAP is faster than t-SNE and can transform new data
Can compute silhouette, Davies-Bouldin, and Calinski-Harabasz scores
Can implement a full RFM segmentation pipeline: feature engineering → scale → cluster → profile
Completed project: Customer segmentation with cluster profiling and business interpretation

✅ Part 3 Complete! You have all the Classical ML foundations. Move to Part 4 — LLM APIs to start the AI Engineering path: prompting, structured outputs, streaming, and reliability.

← P3-M09: Ensembles 🗺️ All Modules Next: P4-M11 — Prompting →