What This Module Covers
FoundationNumPy and Pandas are the two libraries you will use in literally every AI/ML project. NumPy provides fast vectorised numerical computing — the engine under PyTorch, Scikit-learn, and NumPy arrays. Pandas provides the DataFrame — the universal data table for loading, cleaning, and transforming real-world data.
- NumPy — ndarray creation, indexing, slicing, broadcasting, vectorised operations, statistical functions
- Pandas Series — one-dimensional labelled array, the column of a DataFrame
- Pandas DataFrame — the core data structure: reading CSVs, inspecting data, indexing with .loc/.iloc
- Data cleaning — handling NaN values, removing duplicates, type conversion
- Data manipulation — filtering, sorting, groupby, aggregation, merge/join, pivot tables
- String operations — .str accessor for text data cleaning
- Datetime handling — pd.to_datetime(), .dt accessor, time-series operations
Why These Two Libraries
Context| Library | What it does | Used in |
|---|---|---|
| NumPy | N-dimensional arrays, vectorised math, linear algebra. C-backed — 10–100× faster than Python loops. | Scikit-learn internals, PyTorch tensors, image processing, embedding vectors |
| Pandas | DataFrame for tabular data. Read CSVs, clean messy data, group/aggregate, merge datasets. | Every ML project for data loading and EDA. Feature engineering pipeline. |
💡 NumPy array vs Python list: A Python list can hold mixed types and is stored as pointers to objects. A NumPy array holds a single type in contiguous memory — like a C array. This is why np.sum(arr) is 50× faster than sum(list) for large data.
Module Connections
Dependencies- P1-M03 (Dev Essentials) — JSON/API responses are converted to DataFrames constantly
- P2 (Stats & EDA) — all statistical analysis uses Pandas + NumPy directly
- P3 (Classical ML) — Scikit-learn expects NumPy arrays as input (X, y)
- P5 (RAG) — document metadata stored as DataFrames before ingestion into vector DBs
- P7 (Production) — log analysis and monitoring data processed with Pandas
NumPy Array Fundamentals
Week 1import numpy as np # Creating arrays a = np.array([1, 2, 3, 4, 5]) # from Python list b = np.zeros((3, 4)) # 3×4 matrix of zeros c = np.ones((2, 3), dtype=np.float32) # with dtype d = np.arange(0, 10, 2) # [0, 2, 4, 6, 8] e = np.linspace(0, 1, 5) # [0, 0.25, 0.5, 0.75, 1.0] f = np.random.randn(3, 3) # 3×3 standard normal # Key attributes print(a.shape) # (5,) — 1D with 5 elements print(b.shape) # (3, 4) — 2D: 3 rows, 4 cols print(b.dtype) # float64 — default numeric type print(b.ndim) # 2 — number of dimensions print(b.size) # 12 — total elements
Indexing, Slicing and Boolean Masking
Essentialarr = np.array([[1,2,3],[4,5,6],[7,8,9]]) # Indexing — [row, col] print(arr[0, 1]) # 2 — row 0, col 1 print(arr[-1, -1]) # 9 — last row, last col # Slicing — [row_start:row_end, col_start:col_end] print(arr[0:2, 1:]) # [[2,3],[5,6]] — rows 0-1, cols 1+ print(arr[:, 0]) # [1, 4, 7] — entire first column print(arr[1, :]) # [4, 5, 6] — entire second row # Boolean masking — critical for data filtering scores = np.array([55, 72, 88, 43, 95, 61]) mask = scores > 70 print(mask) # [F, T, T, F, T, F] print(scores[mask]) # [72, 88, 95] — fancy indexing print(scores[scores > 70]) # same — inline # Combine conditions print(scores[(scores > 60) & (scores < 90)]) # [72, 88, 61]
⚠️ NumPy slices are VIEWS, not copies. Modifying a slice modifies the original array. Always use arr.copy() when you need an independent copy. This is the single most common NumPy bug.
Vectorised Operations and Broadcasting
PerformanceVectorised operations apply element-wise without Python loops — this is where NumPy's speed comes from.
a = np.array([1, 2, 3, 4]) b = np.array([10, 20, 30, 40]) # Element-wise — no loops needed print(a + b) # [11, 22, 33, 44] print(a * b) # [10, 40, 90, 160] print(a ** 2) # [1, 4, 9, 16] print(np.sqrt(a)) # [1.0, 1.41, 1.73, 2.0] # Statistical functions print(np.mean(a)) # 2.5 print(np.std(a)) # 1.118... print(np.sum(a)) # 10 print(np.min(a), np.max(a), np.argmax(a)) # 1 4 3 # Broadcasting — smaller array stretches to match larger matrix = np.ones((3, 4)) row = np.array([1, 2, 3, 4]) # shape (4,) result = matrix + row # row broadcast across 3 rows print(result) # [[2,3,4,5], # [2,3,4,5], # [2,3,4,5]]
Reshape, Stack and Linear Algebra
ML Prep# Reshape — change shape without changing data a = np.arange(12) print(a.reshape(3, 4)) # 3 rows, 4 cols print(a.reshape(2, -1)) # -1 = infer (becomes 2×6) print(a.flatten()) # back to 1D # Transpose — swap rows and cols m = np.array([[1,2,3],[4,5,6]]) print(m.T) # shape (2,3) → (3,2) # Stacking arrays x = np.array([1,2,3]) y = np.array([4,5,6]) print(np.vstack([x, y])) # [[1,2,3],[4,5,6]] vertical print(np.hstack([x, y])) # [1,2,3,4,5,6] horizontal # Matrix multiplication — critical for ML A = np.array([[1,2],[3,4]]) B = np.array([[5,6],[7,8]]) print(A @ B) # matrix multiply: [[19,22],[43,50]] print(np.dot(A, B)) # equivalent print(A * B) # element-wise (NOT matrix multiply)
💡 Remember: @ is matrix multiplication (dot product). * is element-wise. This distinction is critical — using the wrong one in ML code produces silently wrong results.
Series and DataFrame Basics
Week 2import pandas as pd import numpy as np # Series — 1D labelled array (a single column) s = pd.Series([10, 20, 30], index=["a", "b", "c"]) print(s["b"]) # 20 print(s.dtype) # int64 # DataFrame — 2D table (rows × columns) df = pd.DataFrame({ "name": ["Alice", "Bob", "Charlie"], "score": [85, 92, 78], "grade": ["B", "A", "C"] }) # Loading from files df = pd.read_csv("students.csv") df = pd.read_json("data.json") df = pd.read_excel("report.xlsx") # First look at a dataset print(df.head(5)) # first 5 rows print(df.tail(3)) # last 3 rows print(df.shape) # (rows, cols) print(df.dtypes) # column types print(df.info()) # types + non-null counts print(df.describe()) # count, mean, std, min, quartiles, max
Indexing — .loc vs .iloc
Essential# .loc — label-based indexing (use column names, index labels) df.loc[0] # row with index label 0 df.loc[0, "name"] # cell: row 0, column "name" df.loc[0:2, ["name","score"]] # rows 0-2, two columns (INCLUSIVE) # .iloc — position-based indexing (like NumPy) df.iloc[0] # first row df.iloc[0, 1] # row 0, column 1 (exclusive end) df.iloc[0:3, :2] # first 3 rows, first 2 cols df.iloc[-1] # last row # Boolean filtering — the most common pattern high_scores = df[df["score"] > 80] top_students = df[(df["score"] > 80) & (df["grade"] == "A")] # Selecting columns df["name"] # returns Series df[["name", "score"]] # returns DataFrame with 2 cols
⚠️ .loc endpoint is INCLUSIVE, .iloc endpoint is EXCLUSIVE. df.loc[0:3] returns rows 0,1,2,3. df.iloc[0:3] returns rows 0,1,2. This trips up everyone coming from Python slicing.
Data Cleaning — The Real Work
Week 2–3Real-world data is always messy. Expect 60–70% of your time on any ML project to be data cleaning. Pandas has excellent tools for it.
# Detecting and handling NaN (missing values) print(df.isnull().sum()) # count NaN per column print(df.isnull().sum() / len(df)) # percentage missing df.dropna() # drop rows with ANY NaN df.dropna(subset=["score"]) # drop only if score is NaN df.dropna(thresh=3) # keep rows with at least 3 non-NaN df.fillna(0) # fill all NaN with 0 df["score"].fillna(df["score"].mean()) # fill with column mean df["score"].ffill() # forward fill (time series) # Duplicates df.duplicated().sum() # count duplicate rows df.drop_duplicates() # remove all duplicates df.drop_duplicates(subset=["name"]) # based on specific cols # Type conversion df["score"] = df["score"].astype(float) df["date"] = pd.to_datetime(df["date"]) df["grade"] = df["grade"].astype("category") # saves memory # String cleaning df["name"] = df["name"].str.strip().str.lower() df["email"] = df["email"].str.contains("@") # returns bool Series
💡 Always use .copy() when creating a subset DataFrame. df_clean = df[df["score"] > 0].copy() — without .copy() you get a SettingWithCopyWarning and changes to df_clean may or may not affect the original. This is Pandas' most confusing behaviour.
GroupBy — Split, Apply, Combine
Week 3# groupby() — the most powerful Pandas operation # Pattern: split data into groups → apply function → combine results df = pd.DataFrame({ "city": ["Mumbai", "Delhi", "Mumbai", "Delhi", "Mumbai"], "sales": [100, 200, 150, 180, 120], "month": ["Jan", "Jan", "Feb", "Feb", "Mar"] }) # Basic aggregations df.groupby("city")["sales"].mean() # mean sales per city df.groupby("city")["sales"].sum() # total sales per city df.groupby("city")["sales"].count() # number of records per city # Multiple aggregations at once df.groupby("city").agg({ "sales": ["sum", "mean", "count"] }) # Group by multiple columns df.groupby(["city", "month"])["sales"].sum() # transform — adds group stat back to original rows df["city_avg"] = df.groupby("city")["sales"].transform("mean") df["pct_of_city"] = df["sales"] / df["city_avg"]
Merge, Join and Concat
Week 3# merge — SQL-style join on a key column users = pd.DataFrame({"id": [1,2,3], "name": ["Alice","Bob","Charlie"]}) orders = pd.DataFrame({"user_id": [1,1,2], "amount": [50,75,30]}) pd.merge(orders, users, left_on="user_id", right_on="id", how="left") # how: "inner"(default), "left", "right", "outer" # concat — stack DataFrames vertically or horizontally train = pd.read_csv("train.csv") test = pd.read_csv("test.csv") full = pd.concat([train, test], ignore_index=True) # vertical stack # pivot_table — Excel-style pivot pivot = df.pivot_table( values="sales", index="city", columns="month", aggfunc="sum", fill_value=0 )
Datetime Handling
Time Series# Parse dates on read df = pd.read_csv("data.csv", parse_dates=["date"]) # Convert string column to datetime df["date"] = pd.to_datetime(df["date"]) # .dt accessor — extract components df["year"] = df["date"].dt.year df["month"] = df["date"].dt.month df["weekday"] = df["date"].dt.day_name() # "Monday", "Tuesday"... df["quarter"] = df["date"].dt.quarter # Rolling window — used for moving averages (COVID 7-day rolling avg) df["rolling_7"] = df["cases"].rolling(window=7).mean() # Resample — aggregate by time period df.set_index("date").resample("M")["sales"].sum() # monthly totals
The Pandas Power Patterns — Memorise These
Must KnowThese patterns appear in virtually every data science and ML project. Learn them until they are automatic.
# 1. Boolean masking — most common filtering pattern df[df["age"] > 30] df[(df["age"] > 30) & (df["city"] == "Mumbai")] df[df["name"].isin(["Alice", "Bob"])] df[~df["score"].isna()] # ~ inverts boolean # 2. Chain operations — readable pipeline result = (df .dropna(subset=["score"]) .query("score > 60") .groupby("city")["sales"] .mean() .sort_values(ascending=False) .head(10) ) # 3. apply() with lambda — transform column values df["score_normalised"] = df["score"].apply(lambda x: (x - 50) / 50) df["grade"] = df["score"].apply(lambda x: "A" if x>=90 else "B" if x>=80 else "C") # 4. Always .copy() on subsets df_clean = df[df["score"] > 0].copy() # 5. pd.get_dummies — one-hot encoding (used in every ML project) df_encoded = pd.get_dummies(df, columns=["city", "grade"]) # 6. value_counts — quick frequency distribution df["city"].value_counts() df["city"].value_counts(normalize=True) # proportions # 7. nunique — number of unique values per column df.nunique() # quick cardinality check before one-hot encoding
NumPy ↔ Pandas Interoperability
Integration# DataFrame → NumPy array (for Scikit-learn, PyTorch) X = df[["age", "score", "income"]].values # .values returns ndarray y = df["target"].to_numpy() # explicit and preferred print(X.shape) # (n_samples, n_features) # NumPy array → DataFrame arr = np.random.randn(100, 3) df2 = pd.DataFrame(arr, columns=["x1", "x2", "x3"]) # Apply NumPy functions to Pandas columns df["log_income"] = np.log(df["income"] + 1) # log transform df["z_score"] = (df["score"] - df["score"].mean()) / df["score"].std() # The full ML data prep pipeline # 1. Load with pd.read_csv # 2. Clean with Pandas (drop NaN, fix types, remove duplicates) # 3. Engineer features with Pandas + NumPy # 4. Encode categoricals with pd.get_dummies or LabelEncoder # 5. Convert to NumPy with .values or to_numpy() # 6. Pass to Scikit-learn / PyTorch
Performance — When Pandas Gets Slow
Production- Never iterate with for loops over DataFrame rows — use vectorised operations, .apply(), or .map()
- Use categorical dtype for string columns with low cardinality (e.g. city, grade) — cuts memory 10×
- Read large CSVs in chunks —
pd.read_csv(..., chunksize=10000)for files that don't fit in RAM - Use .query() for complex filters — often faster than boolean indexing on large DataFrames
- Avoid object dtype — mixed types in a column cause it to use object dtype (slow). Always fix types on load.
# Slow — Python loop over rows (never do this) for i, row in df.iterrows(): df.at[i, "new_col"] = row["score"] * 2 # Fast — vectorised (1000× faster) df["new_col"] = df["score"] * 2 # Check memory usage df.memory_usage(deep=True).sum() / 1024**2 # MB
3-WEEK STRUCTURED PLAN
| Week | Topics | Daily Task / Mini-Project |
|---|---|---|
| Week 1 NumPy |
Install NumPy. ndarray creation: np.array, np.zeros, np.ones, np.arange, np.linspace, np.random. Array indexing, slicing (2D), boolean masking. Vectorised arithmetic — why no loops needed. Broadcasting rules. NumPy math: mean, std, sum, dot, reshape, transpose. | Day 1–2: Compute statistics on a random student score array without any Python loops. Day 3–4: Implement matrix multiplication using np.dot — verify against manual calculation. Day 5–7: Reshape a 1D sensor data array into a 2D time-series matrix and extract windows. |
| Week 2 Pandas Basics |
Pandas Series vs DataFrame. pd.read_csv(), .head(), .info(), .describe(), .shape. Indexing: .loc[], .iloc[], boolean filtering. Handling NaN: .isnull(), .dropna(), .fillna(). Removing duplicates. Type conversion with .astype() and pd.to_datetime(). | Day 1–2: Load the COVID-19 dataset — write a 10-line "data health report" (shape, dtypes, null counts, value ranges). Day 3–4: Find and handle all missing values — document your strategy (drop vs fill) with justification. Day 5–7: Filter a real DataFrame matching multiple conditions — export result to new CSV. |
| Week 3 Pandas Advanced |
groupby() — split-apply-combine pattern. .agg(), .transform(). Merging DataFrames: merge(), join(), concat(). Pivot tables: pd.pivot_table(). String operations: .str.lower(), .str.contains(), .str.replace(). Datetime: pd.to_datetime(), .dt.year, .dt.month. Rolling windows. | Day 1–2: Find top 5 countries by total COVID cases using groupby + sort. Day 3–4: Merge two datasets on a common key — verify row counts before and after. Day 5–7: Full milestone project — COVID-19 Global Data Analysis (see Projects tab). |
FREE LEARNING RESOURCES
| Type | Resource | Best For |
|---|---|---|
| Course | Kaggle Pandas Course (Free, Interactive) | Best hands-on Pandas. Exercises with instant feedback. Complete in Week 2. |
| Video | NumPy for Beginners — freeCodeCamp (YouTube) | Complete NumPy from scratch. Watch at start of Week 1. |
| Docs | Pandas Official Documentation — User Guide | Authoritative reference. "10 Minutes to Pandas" is a must-read in Week 2. |
| Video | Corey Schafer — Pandas Tutorials (YouTube Playlist) | Deep Pandas tutorials. Best for groupby and merge concepts. |
| Course | Kaggle NumPy Course (Free) | NumPy fundamentals with practice exercises. |
| Cheatsheet | Pandas Cheat Sheet — DataCamp (Free PDF) | Quick reference. Print and keep beside you during Week 2–3. |
FREE DATASETS FOR PRACTICE
| Type | Dataset | Practice Focus |
|---|---|---|
| Dataset | COVID-19 Global Dataset — Kaggle | Time-series, rolling averages, groupby, datetime handling — Milestone project |
| Dataset | Titanic Dataset — Kaggle (Classic) | Missing values, groupby, boolean filtering practice |
| Dataset | World Population Dataset — Kaggle | Merging, pivoting, multi-column groupby |
| Dataset | Netflix Shows Dataset — Kaggle | String operations, datetime handling, value_counts |
MILESTONE PROJECT
Use the COVID-19 global dataset to demonstrate the full NumPy + Pandas data pipeline. This project is your first real data analysis — the same workflow you will use on every ML project.
Requirements
- Load and inspect: shape, dtypes, null counts, value ranges per column
- Clean: handle NaN values, fix date column to datetime, remove duplicates
- Compute rolling 7-day average of daily cases per country
- Find top 10 countries by total deaths-per-million (merge population data if needed)
- Identify months with the highest case surges using groupby + datetime
- Export a cleaned summary CSV with one row per country: total_cases, total_deaths, peak_month, rolling_avg_peak
Stretch Goals
- Compare case trajectories of 5 countries using a pivot table (country vs month)
- Detect the date of peak cases for each country programmatically
- Add a simple bar chart using Matplotlib (preview of Part 2 visualisation)
Skills: NumPy operations, Pandas cleaning, groupby, datetime, merge, rolling windows, export
Dataset: COVID-19 Global Dataset — Kaggle
MINI-PROJECTS
Generate a random 100×5 array of student scores (np.random.randint). Without any Python for/while loops: compute mean, std, min, max per subject; find students scoring above class average in every subject; normalise all scores to 0–1 range using vectorised operations only.
Load the Titanic dataset. Produce a printed report covering: (1) overall shape and column types, (2) null percentage per column with fill strategy recommendation, (3) survival rate by sex, class, and embarked port using groupby, (4) age distribution statistics, (5) output a cleaned version with nulls handled.
NumPy Vectorisation — Measure the Speedup
Objective: Directly measure why NumPy vectorised operations replace Python loops in all ML code.
import random; py_list = [random.random() for _ in range(1_000_000)] and np_arr = np.array(py_list).import time; t = time.time(); result = [x**2 for x in py_list]; print(time.time() - t).t = time.time(); result = np_arr ** 2; print(time.time() - t). Record the ratio — it should be 10–100× faster.a = np.arange(10), then b = a[2:5]. Modify b[0] = 99. Print a. Now do the same with b = a[2:5].copy() and repeat. Document what you observe.np.where(arr > 0.5, "high", "low"). Apply this to classify 1000 random scores as pass/fail without any Python loop.Pandas Data Investigation Pipeline
Objective: Build a reusable function that generates a data quality report for any DataFrame — a tool you will use on every future project.
import pandas as pd; df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")data_report(df: pd.DataFrame) -> None that prints: shape, dtypes, null count + percentage per column, numeric column statistics (mean, std, min, max), categorical column value_counts (top 5 per column).df.select_dtypes(include=np.number).corr(). Print the top 5 feature pairs with highest absolute correlation.The SettingWithCopyWarning — Understand It Once and For All
Objective: Understand Pandas' most confusing behaviour — the difference between views and copies — so it never silently breaks your code.
df = pd.DataFrame({"a": [1,2,3,4,5], "b": [10,20,30,40,50]}). Then: subset = df[df["a"] > 2]. Try subset["b"] = 99. Observe the warning.subset = df[df["a"] > 2].copy(). Repeat subset["b"] = 99. Check df — did it change? Confirm that .copy() creates an independent object.df.loc[df["a"] > 2, "b"] = 99. This modifies df directly without warnings. Print df to confirm.P1-M02 MASTERY CHECKLIST
- Can create NumPy arrays using np.array, np.zeros, np.ones, np.arange, np.linspace, np.random.randn
- Know the difference between array view and array copy — and when each is created
- Can perform element-wise operations on arrays without any Python loops
- Can explain broadcasting with a concrete example (adding a row vector to a matrix)
- Know the difference between @ (matrix multiply) and * (element-wise) — and why it matters
- Can load a CSV into a DataFrame and immediately inspect it with .head(), .info(), .describe()
- Know the difference between .loc[] (label-based) and .iloc[] (position-based) — including endpoint behaviour
- Can filter a DataFrame with boolean masking using AND (&) and OR (|) conditions
- Can handle NaN values — know when to drop vs fill and which fill strategy to use
- Can perform a groupby aggregation and describe what split-apply-combine means
- Can merge two DataFrames on a key column using inner, left, right, and outer joins
- Can parse a date column with pd.to_datetime and extract year, month, weekday using .dt accessor
- Always use .copy() when creating a subset DataFrame to avoid SettingWithCopyWarning
- Can convert a clean DataFrame to a NumPy array with .values or .to_numpy() for Scikit-learn input
- Completed Lab 1: NumPy vectorisation speedup measurement
- Completed Lab 2: Pandas data investigation pipeline on Titanic dataset
- Completed Lab 3: Understood SettingWithCopyWarning
- Milestone project pushed to GitHub with README
✅ When complete: Move to P1-M03 — Developer Essentials (Git, CLI, APIs & Async). The JSON and CSV skills you built here connect directly to calling REST APIs and parsing their responses in M03.