Part 1 — Universal Foundation  ·  Module 02 of 04
NumPy & Pandas Data Toolkit
Vectorised computing and data wrangling — the backbone of all ML work
⏱ 3 Weeks 🟡 Beginner–Intermediate 🔢 NumPy · Pandas 📋 Prerequisite: P1-M01 🛠 Jupyter / Colab
🎯

What This Module Covers

Foundation

NumPy and Pandas are the two libraries you will use in literally every AI/ML project. NumPy provides fast vectorised numerical computing — the engine under PyTorch, Scikit-learn, and NumPy arrays. Pandas provides the DataFrame — the universal data table for loading, cleaning, and transforming real-world data.

  • NumPy — ndarray creation, indexing, slicing, broadcasting, vectorised operations, statistical functions
  • Pandas Series — one-dimensional labelled array, the column of a DataFrame
  • Pandas DataFrame — the core data structure: reading CSVs, inspecting data, indexing with .loc/.iloc
  • Data cleaning — handling NaN values, removing duplicates, type conversion
  • Data manipulation — filtering, sorting, groupby, aggregation, merge/join, pivot tables
  • String operations — .str accessor for text data cleaning
  • Datetime handling — pd.to_datetime(), .dt accessor, time-series operations
⚡ SKIP IF: You know Java/C++ arrays and ArrayLists — NumPy arrays are conceptually similar but with vectorised operations (no explicit loops needed). Pandas DataFrame is like a database table or Excel spreadsheet. Skim NumPy basics and focus on Pandas, which is uniquely Python/data-focused with no direct equivalent in compiled languages.
🔗

Why These Two Libraries

Context
LibraryWhat it doesUsed in
NumPyN-dimensional arrays, vectorised math, linear algebra. C-backed — 10–100× faster than Python loops.Scikit-learn internals, PyTorch tensors, image processing, embedding vectors
PandasDataFrame for tabular data. Read CSVs, clean messy data, group/aggregate, merge datasets.Every ML project for data loading and EDA. Feature engineering pipeline.

💡 NumPy array vs Python list: A Python list can hold mixed types and is stored as pointers to objects. A NumPy array holds a single type in contiguous memory — like a C array. This is why np.sum(arr) is 50× faster than sum(list) for large data.

🔗

Module Connections

Dependencies
  • P1-M03 (Dev Essentials) — JSON/API responses are converted to DataFrames constantly
  • P2 (Stats & EDA) — all statistical analysis uses Pandas + NumPy directly
  • P3 (Classical ML) — Scikit-learn expects NumPy arrays as input (X, y)
  • P5 (RAG) — document metadata stored as DataFrames before ingestion into vector DBs
  • P7 (Production) — log analysis and monitoring data processed with Pandas
🔢

NumPy Array Fundamentals

Week 1
import numpy as np

# Creating arrays
a = np.array([1, 2, 3, 4, 5])              # from Python list
b = np.zeros((3, 4))                      # 3×4 matrix of zeros
c = np.ones((2, 3), dtype=np.float32)     # with dtype
d = np.arange(0, 10, 2)                   # [0, 2, 4, 6, 8]
e = np.linspace(0, 1, 5)                  # [0, 0.25, 0.5, 0.75, 1.0]
f = np.random.randn(3, 3)                # 3×3 standard normal

# Key attributes
print(a.shape)    # (5,)     — 1D with 5 elements
print(b.shape)    # (3, 4)   — 2D: 3 rows, 4 cols
print(b.dtype)    # float64  — default numeric type
print(b.ndim)     # 2        — number of dimensions
print(b.size)     # 12       — total elements
✂️

Indexing, Slicing and Boolean Masking

Essential
arr = np.array([[1,2,3],[4,5,6],[7,8,9]])

# Indexing — [row, col]
print(arr[0, 1])       # 2  — row 0, col 1
print(arr[-1, -1])     # 9  — last row, last col

# Slicing — [row_start:row_end, col_start:col_end]
print(arr[0:2, 1:])    # [[2,3],[5,6]]  — rows 0-1, cols 1+
print(arr[:, 0])        # [1, 4, 7]  — entire first column
print(arr[1, :])         # [4, 5, 6]  — entire second row

# Boolean masking — critical for data filtering
scores = np.array([55, 72, 88, 43, 95, 61])
mask   = scores > 70
print(mask)             # [F, T, T, F, T, F]
print(scores[mask])     # [72, 88, 95]  — fancy indexing
print(scores[scores > 70])  # same — inline

# Combine conditions
print(scores[(scores > 60) & (scores < 90)])  # [72, 88, 61]

⚠️ NumPy slices are VIEWS, not copies. Modifying a slice modifies the original array. Always use arr.copy() when you need an independent copy. This is the single most common NumPy bug.

Vectorised Operations and Broadcasting

Performance

Vectorised operations apply element-wise without Python loops — this is where NumPy's speed comes from.

a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

# Element-wise — no loops needed
print(a + b)          # [11, 22, 33, 44]
print(a * b)          # [10, 40, 90, 160]
print(a ** 2)         # [1, 4, 9, 16]
print(np.sqrt(a))     # [1.0, 1.41, 1.73, 2.0]

# Statistical functions
print(np.mean(a))     # 2.5
print(np.std(a))      # 1.118...
print(np.sum(a))      # 10
print(np.min(a), np.max(a), np.argmax(a))  # 1  4  3

# Broadcasting — smaller array stretches to match larger
matrix = np.ones((3, 4))
row    = np.array([1, 2, 3, 4])    # shape (4,)
result = matrix + row               # row broadcast across 3 rows
print(result)
# [[2,3,4,5],
#  [2,3,4,5],
#  [2,3,4,5]]
📐

Reshape, Stack and Linear Algebra

ML Prep
# Reshape — change shape without changing data
a = np.arange(12)
print(a.reshape(3, 4))    # 3 rows, 4 cols
print(a.reshape(2, -1))    # -1 = infer (becomes 2×6)
print(a.flatten())           # back to 1D

# Transpose — swap rows and cols
m = np.array([[1,2,3],[4,5,6]])
print(m.T)    # shape (2,3) → (3,2)

# Stacking arrays
x = np.array([1,2,3])
y = np.array([4,5,6])
print(np.vstack([x, y]))    # [[1,2,3],[4,5,6]]  vertical
print(np.hstack([x, y]))    # [1,2,3,4,5,6]  horizontal

# Matrix multiplication — critical for ML
A = np.array([[1,2],[3,4]])
B = np.array([[5,6],[7,8]])
print(A @ B)          # matrix multiply: [[19,22],[43,50]]
print(np.dot(A, B))   # equivalent
print(A * B)          # element-wise (NOT matrix multiply)

💡 Remember: @ is matrix multiplication (dot product). * is element-wise. This distinction is critical — using the wrong one in ML code produces silently wrong results.

📊

Series and DataFrame Basics

Week 2
import pandas as pd
import numpy as np

# Series — 1D labelled array (a single column)
s = pd.Series([10, 20, 30], index=["a", "b", "c"])
print(s["b"])    # 20
print(s.dtype)   # int64

# DataFrame — 2D table (rows × columns)
df = pd.DataFrame({
    "name":  ["Alice", "Bob", "Charlie"],
    "score": [85, 92, 78],
    "grade": ["B", "A", "C"]
})

# Loading from files
df = pd.read_csv("students.csv")
df = pd.read_json("data.json")
df = pd.read_excel("report.xlsx")

# First look at a dataset
print(df.head(5))      # first 5 rows
print(df.tail(3))      # last 3 rows
print(df.shape)         # (rows, cols)
print(df.dtypes)        # column types
print(df.info())        # types + non-null counts
print(df.describe())    # count, mean, std, min, quartiles, max
🎯

Indexing — .loc vs .iloc

Essential
# .loc — label-based indexing (use column names, index labels)
df.loc[0]                       # row with index label 0
df.loc[0, "name"]               # cell: row 0, column "name"
df.loc[0:2, ["name","score"]]   # rows 0-2, two columns (INCLUSIVE)

# .iloc — position-based indexing (like NumPy)
df.iloc[0]            # first row
df.iloc[0, 1]         # row 0, column 1 (exclusive end)
df.iloc[0:3, :2]      # first 3 rows, first 2 cols
df.iloc[-1]           # last row

# Boolean filtering — the most common pattern
high_scores = df[df["score"] > 80]
top_students = df[(df["score"] > 80) & (df["grade"] == "A")]

# Selecting columns
df["name"]             # returns Series
df[["name", "score"]]  # returns DataFrame with 2 cols

⚠️ .loc endpoint is INCLUSIVE, .iloc endpoint is EXCLUSIVE. df.loc[0:3] returns rows 0,1,2,3. df.iloc[0:3] returns rows 0,1,2. This trips up everyone coming from Python slicing.

🧹

Data Cleaning — The Real Work

Week 2–3

Real-world data is always messy. Expect 60–70% of your time on any ML project to be data cleaning. Pandas has excellent tools for it.

# Detecting and handling NaN (missing values)
print(df.isnull().sum())          # count NaN per column
print(df.isnull().sum() / len(df))  # percentage missing

df.dropna()                       # drop rows with ANY NaN
df.dropna(subset=["score"])       # drop only if score is NaN
df.dropna(thresh=3)              # keep rows with at least 3 non-NaN
df.fillna(0)                     # fill all NaN with 0
df["score"].fillna(df["score"].mean())  # fill with column mean
df["score"].ffill()              # forward fill (time series)

# Duplicates
df.duplicated().sum()             # count duplicate rows
df.drop_duplicates()              # remove all duplicates
df.drop_duplicates(subset=["name"])  # based on specific cols

# Type conversion
df["score"] = df["score"].astype(float)
df["date"]  = pd.to_datetime(df["date"])
df["grade"] = df["grade"].astype("category")  # saves memory

# String cleaning
df["name"] = df["name"].str.strip().str.lower()
df["email"] = df["email"].str.contains("@")  # returns bool Series

💡 Always use .copy() when creating a subset DataFrame. df_clean = df[df["score"] > 0].copy() — without .copy() you get a SettingWithCopyWarning and changes to df_clean may or may not affect the original. This is Pandas' most confusing behaviour.

🔄

GroupBy — Split, Apply, Combine

Week 3
# groupby() — the most powerful Pandas operation
# Pattern: split data into groups → apply function → combine results

df = pd.DataFrame({
    "city":  ["Mumbai", "Delhi", "Mumbai", "Delhi", "Mumbai"],
    "sales": [100, 200, 150, 180, 120],
    "month": ["Jan", "Jan", "Feb", "Feb", "Mar"]
})

# Basic aggregations
df.groupby("city")["sales"].mean()     # mean sales per city
df.groupby("city")["sales"].sum()      # total sales per city
df.groupby("city")["sales"].count()    # number of records per city

# Multiple aggregations at once
df.groupby("city").agg({
    "sales": ["sum", "mean", "count"]
})

# Group by multiple columns
df.groupby(["city", "month"])["sales"].sum()

# transform — adds group stat back to original rows
df["city_avg"] = df.groupby("city")["sales"].transform("mean")
df["pct_of_city"] = df["sales"] / df["city_avg"]
🔀

Merge, Join and Concat

Week 3
# merge — SQL-style join on a key column
users  = pd.DataFrame({"id": [1,2,3], "name": ["Alice","Bob","Charlie"]})
orders = pd.DataFrame({"user_id": [1,1,2], "amount": [50,75,30]})

pd.merge(orders, users, left_on="user_id", right_on="id", how="left")
# how: "inner"(default), "left", "right", "outer"

# concat — stack DataFrames vertically or horizontally
train = pd.read_csv("train.csv")
test  = pd.read_csv("test.csv")
full  = pd.concat([train, test], ignore_index=True)   # vertical stack

# pivot_table — Excel-style pivot
pivot = df.pivot_table(
    values="sales",
    index="city",
    columns="month",
    aggfunc="sum",
    fill_value=0
)
📅

Datetime Handling

Time Series
# Parse dates on read
df = pd.read_csv("data.csv", parse_dates=["date"])

# Convert string column to datetime
df["date"] = pd.to_datetime(df["date"])

# .dt accessor — extract components
df["year"]    = df["date"].dt.year
df["month"]   = df["date"].dt.month
df["weekday"] = df["date"].dt.day_name()    # "Monday", "Tuesday"...
df["quarter"] = df["date"].dt.quarter

# Rolling window — used for moving averages (COVID 7-day rolling avg)
df["rolling_7"] = df["cases"].rolling(window=7).mean()

# Resample — aggregate by time period
df.set_index("date").resample("M")["sales"].sum()  # monthly totals

The Pandas Power Patterns — Memorise These

Must Know

These patterns appear in virtually every data science and ML project. Learn them until they are automatic.

# 1. Boolean masking — most common filtering pattern
df[df["age"] > 30]
df[(df["age"] > 30) & (df["city"] == "Mumbai")]
df[df["name"].isin(["Alice", "Bob"])]
df[~df["score"].isna()]   # ~ inverts boolean

# 2. Chain operations — readable pipeline
result = (df
    .dropna(subset=["score"])
    .query("score > 60")
    .groupby("city")["sales"]
    .mean()
    .sort_values(ascending=False)
    .head(10)
)

# 3. apply() with lambda — transform column values
df["score_normalised"] = df["score"].apply(lambda x: (x - 50) / 50)
df["grade"] = df["score"].apply(lambda x: "A" if x>=90 else "B" if x>=80 else "C")

# 4. Always .copy() on subsets
df_clean = df[df["score"] > 0].copy()

# 5. pd.get_dummies — one-hot encoding (used in every ML project)
df_encoded = pd.get_dummies(df, columns=["city", "grade"])

# 6. value_counts — quick frequency distribution
df["city"].value_counts()
df["city"].value_counts(normalize=True)  # proportions

# 7. nunique — number of unique values per column
df.nunique()   # quick cardinality check before one-hot encoding
🔗

NumPy ↔ Pandas Interoperability

Integration
# DataFrame → NumPy array (for Scikit-learn, PyTorch)
X = df[["age", "score", "income"]].values   # .values returns ndarray
y = df["target"].to_numpy()                   # explicit and preferred
print(X.shape)   # (n_samples, n_features)

# NumPy array → DataFrame
arr = np.random.randn(100, 3)
df2 = pd.DataFrame(arr, columns=["x1", "x2", "x3"])

# Apply NumPy functions to Pandas columns
df["log_income"] = np.log(df["income"] + 1)  # log transform
df["z_score"]   = (df["score"] - df["score"].mean()) / df["score"].std()

# The full ML data prep pipeline
# 1. Load with pd.read_csv
# 2. Clean with Pandas (drop NaN, fix types, remove duplicates)
# 3. Engineer features with Pandas + NumPy
# 4. Encode categoricals with pd.get_dummies or LabelEncoder
# 5. Convert to NumPy with .values or to_numpy()
# 6. Pass to Scikit-learn / PyTorch
🐌

Performance — When Pandas Gets Slow

Production
  • Never iterate with for loops over DataFrame rows — use vectorised operations, .apply(), or .map()
  • Use categorical dtype for string columns with low cardinality (e.g. city, grade) — cuts memory 10×
  • Read large CSVs in chunkspd.read_csv(..., chunksize=10000) for files that don't fit in RAM
  • Use .query() for complex filters — often faster than boolean indexing on large DataFrames
  • Avoid object dtype — mixed types in a column cause it to use object dtype (slow). Always fix types on load.
# Slow — Python loop over rows (never do this)
for i, row in df.iterrows():
    df.at[i, "new_col"] = row["score"] * 2

# Fast — vectorised (1000× faster)
df["new_col"] = df["score"] * 2

# Check memory usage
df.memory_usage(deep=True).sum() / 1024**2   # MB

3-WEEK STRUCTURED PLAN

WeekTopicsDaily Task / Mini-Project
Week 1
NumPy
Install NumPy. ndarray creation: np.array, np.zeros, np.ones, np.arange, np.linspace, np.random. Array indexing, slicing (2D), boolean masking. Vectorised arithmetic — why no loops needed. Broadcasting rules. NumPy math: mean, std, sum, dot, reshape, transpose. Day 1–2: Compute statistics on a random student score array without any Python loops. Day 3–4: Implement matrix multiplication using np.dot — verify against manual calculation. Day 5–7: Reshape a 1D sensor data array into a 2D time-series matrix and extract windows.
Week 2
Pandas Basics
Pandas Series vs DataFrame. pd.read_csv(), .head(), .info(), .describe(), .shape. Indexing: .loc[], .iloc[], boolean filtering. Handling NaN: .isnull(), .dropna(), .fillna(). Removing duplicates. Type conversion with .astype() and pd.to_datetime(). Day 1–2: Load the COVID-19 dataset — write a 10-line "data health report" (shape, dtypes, null counts, value ranges). Day 3–4: Find and handle all missing values — document your strategy (drop vs fill) with justification. Day 5–7: Filter a real DataFrame matching multiple conditions — export result to new CSV.
Week 3
Pandas Advanced
groupby() — split-apply-combine pattern. .agg(), .transform(). Merging DataFrames: merge(), join(), concat(). Pivot tables: pd.pivot_table(). String operations: .str.lower(), .str.contains(), .str.replace(). Datetime: pd.to_datetime(), .dt.year, .dt.month. Rolling windows. Day 1–2: Find top 5 countries by total COVID cases using groupby + sort. Day 3–4: Merge two datasets on a common key — verify row counts before and after. Day 5–7: Full milestone project — COVID-19 Global Data Analysis (see Projects tab).

FREE LEARNING RESOURCES

TypeResourceBest For
CourseKaggle Pandas Course (Free, Interactive)Best hands-on Pandas. Exercises with instant feedback. Complete in Week 2.
VideoNumPy for Beginners — freeCodeCamp (YouTube)Complete NumPy from scratch. Watch at start of Week 1.
DocsPandas Official Documentation — User GuideAuthoritative reference. "10 Minutes to Pandas" is a must-read in Week 2.
VideoCorey Schafer — Pandas Tutorials (YouTube Playlist)Deep Pandas tutorials. Best for groupby and merge concepts.
CourseKaggle NumPy Course (Free)NumPy fundamentals with practice exercises.
CheatsheetPandas Cheat Sheet — DataCamp (Free PDF)Quick reference. Print and keep beside you during Week 2–3.

FREE DATASETS FOR PRACTICE

TypeDatasetPractice Focus
DatasetCOVID-19 Global Dataset — KaggleTime-series, rolling averages, groupby, datetime handling — Milestone project
DatasetTitanic Dataset — Kaggle (Classic)Missing values, groupby, boolean filtering practice
DatasetWorld Population Dataset — KaggleMerging, pivoting, multi-column groupby
DatasetNetflix Shows Dataset — KaggleString operations, datetime handling, value_counts

MILESTONE PROJECT

🛠 COVID-19 Global Data Analysis [Beginner] 4–5 days · Week 3

Use the COVID-19 global dataset to demonstrate the full NumPy + Pandas data pipeline. This project is your first real data analysis — the same workflow you will use on every ML project.

Requirements

  • Load and inspect: shape, dtypes, null counts, value ranges per column
  • Clean: handle NaN values, fix date column to datetime, remove duplicates
  • Compute rolling 7-day average of daily cases per country
  • Find top 10 countries by total deaths-per-million (merge population data if needed)
  • Identify months with the highest case surges using groupby + datetime
  • Export a cleaned summary CSV with one row per country: total_cases, total_deaths, peak_month, rolling_avg_peak

Stretch Goals

  • Compare case trajectories of 5 countries using a pivot table (country vs month)
  • Detect the date of peak cases for each country programmatically
  • Add a simple bar chart using Matplotlib (preview of Part 2 visualisation)

Skills: NumPy operations, Pandas cleaning, groupby, datetime, merge, rolling windows, export

Dataset: COVID-19 Global Dataset — Kaggle

MINI-PROJECTS

🛠Week 1 — NumPy Statistics Without Loops1–2 days

Generate a random 100×5 array of student scores (np.random.randint). Without any Python for/while loops: compute mean, std, min, max per subject; find students scoring above class average in every subject; normalise all scores to 0–1 range using vectorised operations only.

🛠Week 2 — Titanic Data Health Report2–3 days

Load the Titanic dataset. Produce a printed report covering: (1) overall shape and column types, (2) null percentage per column with fill strategy recommendation, (3) survival rate by sex, class, and embarked port using groupby, (4) age distribution statistics, (5) output a cleaned version with nulls handled.

LAB 1

NumPy Vectorisation — Measure the Speedup

Objective: Directly measure why NumPy vectorised operations replace Python loops in all ML code.

1
Create a Python list and NumPy array, both containing 1 million random numbers: import random; py_list = [random.random() for _ in range(1_000_000)] and np_arr = np.array(py_list).
2
Time squaring every element using a Python loop: import time; t = time.time(); result = [x**2 for x in py_list]; print(time.time() - t).
3
Time the same operation with NumPy: t = time.time(); result = np_arr ** 2; print(time.time() - t). Record the ratio — it should be 10–100× faster.
4
Now benchmark: (a) Python loop sum, (b) built-in sum(), (c) np.sum(). Print all three times. Explain why the results differ.
5
Test the view vs copy behaviour: create a = np.arange(10), then b = a[2:5]. Modify b[0] = 99. Print a. Now do the same with b = a[2:5].copy() and repeat. Document what you observe.
6
Bonus: Use np.where() as a vectorised if-else: np.where(arr > 0.5, "high", "low"). Apply this to classify 1000 random scores as pass/fail without any Python loop.
LAB 2

Pandas Data Investigation Pipeline

Objective: Build a reusable function that generates a data quality report for any DataFrame — a tool you will use on every future project.

1
Download the Titanic dataset from Kaggle or use: import pandas as pd; df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
2
Write a function data_report(df: pd.DataFrame) -> None that prints: shape, dtypes, null count + percentage per column, numeric column statistics (mean, std, min, max), categorical column value_counts (top 5 per column).
3
Run data_report on the Titanic dataset. Identify: which columns have missing values, which columns should be dropped, what the survival rate is.
4
Clean the dataset: fill Age NaN with median, drop the Cabin column (too many nulls), fill Embarked NaN with mode. Use .copy() throughout. Verify with isnull().sum() that all nulls are handled.
5
Answer these questions using groupby: (a) What is the survival rate by Sex? (b) What is the survival rate by Pclass? (c) What is the average fare by Pclass and Sex combined?
6
Extension: Add a correlation matrix to your data_report — df.select_dtypes(include=np.number).corr(). Print the top 5 feature pairs with highest absolute correlation.
LAB 3

The SettingWithCopyWarning — Understand It Once and For All

Objective: Understand Pandas' most confusing behaviour — the difference between views and copies — so it never silently breaks your code.

1
Create: df = pd.DataFrame({"a": [1,2,3,4,5], "b": [10,20,30,40,50]}). Then: subset = df[df["a"] > 2]. Try subset["b"] = 99. Observe the warning.
2
Now do: subset = df[df["a"] > 2].copy(). Repeat subset["b"] = 99. Check df — did it change? Confirm that .copy() creates an independent object.
3
Use .loc for safe in-place modification on the original: df.loc[df["a"] > 2, "b"] = 99. This modifies df directly without warnings. Print df to confirm.
4
Write the rule in your own words: when do you use .copy(), when do you use .loc[]? Document this in a comment block you can paste into future projects.

P1-M02 MASTERY CHECKLIST

When complete: Move to P1-M03 — Developer Essentials (Git, CLI, APIs & Async). The JSON and CSV skills you built here connect directly to calling REST APIs and parsing their responses in M03.

← P1-M01: Python 🗺️ All Modules Next: P1-M03 — Dev Essentials →