P5-M16 - Chunking & Document Ingestion

Part 5 — RAG Systems · Module 16 of 18

Chunking & Document Ingestion

How you split documents determines everything about retrieval quality — get this right

⏱ 1 Week 🟡 Intermediate 🔧 LangChain · Unstructured · PyMuPDF 📋 Prerequisite: P5-M15

🎯

What This Module Covers

RAG Layer 2

Chunking is the most underestimated part of RAG. A beautiful embedding model and fast vector DB will still produce terrible retrieval if your chunks are poorly designed. Chunks that are too large dilute the embedding signal. Chunks that are too small lose context. Chunks that break sentences mid-way confuse the LLM. This module teaches you to get it right.

Chunking strategies — fixed-size, recursive, semantic, document-aware, agentic — when each is appropriate
Overlap — why you need it and how much to use
LangChain text splitters — the standard toolkit for chunking
Document loaders — extracting clean text from PDF, DOCX, HTML, Markdown, code
Metadata enrichment — adding source, page, section, headings to every chunk for better filtering
Full ingestion pipeline — load → clean → chunk → embed → store, as a reusable class

🧠

Why Chunking Quality Determines RAG Quality

Motivation

# The chunking problem:
#
# DOCUMENT: 10,000 token technical manual about DPDK
#
# BAD: chunk = entire document
# → embedding averages over everything → signal diluted
# → 10,000 tokens fills context → too expensive
#
# BAD: chunk = 20 tokens (half a sentence)
# → embedding has no context → meaningless
# → "The ring buffer" has no meaning without surrounding text
#
# GOOD: chunk = 300-500 tokens (2-4 paragraphs on one topic)
# → embedding captures a complete idea
# → LLM gets enough context to answer
# → small enough for high precision retrieval

# The overlap problem:
# Without overlap — answers that span chunk boundaries are lost
# "The mempool must be... [CHUNK BOUNDARY] ...initialised before the port"
# → Neither chunk contains the complete fact
#
# With 10-20% overlap — boundary-spanning content appears in both chunks
# → At least one chunk retrieved will contain the complete answer

✂️

The Five Chunking Strategies

Decision Framework

Fixed-Size

Split by token or character count, regardless of content structure. Simple and predictable.

✓ Use when: quick prototype, uniform content, no structure

Recursive

Try to split on paragraph breaks, then sentences, then words — preserves natural boundaries when possible.

✓ Use when: general text — best default strategy

Semantic

Measure embedding similarity between consecutive sentences — split where similarity drops (topic change).

✓ Use when: high quality required, varied content

Document-Aware

Split on structural markers: headings in Markdown, sections in code, HTML tags, PDF pages.

✓ Use when: structured documents (docs, code, PDFs)

Agentic

LLM decides how to chunk — generates chunk boundaries and summaries. Highest quality, highest cost.

✓ Use when: critical domain, small corpus, max quality

📐

Chunk Size Guidelines

Calibration

Chunk Size	Tokens (approx)	Best For	Risk
Tiny	50–100	Keyword-heavy fact retrieval	No context — LLM gets fragments
Small ✓	200–400	Q&A, facts, customer support	May miss multi-paragraph answers
Medium ✓	400–800	Technical docs, general RAG	Good default — balanced precision/recall
Large	800–1500	Long-form summaries, analysis	Embedding signal diluted, slow search
Whole doc	>1500	Do not use for RAG	Precision collapse — everything matches

💡 The golden rule: the chunk should be the smallest unit that can fully answer a likely query. If users ask "What is the DPDK mempool?" — the chunk should contain the complete mempool explanation, not just one sentence about it. Test empirically: try chunk sizes 256, 512, 1024 and measure retrieval precision on real queries.

Overlap — How Much?

# Overlap = how many tokens repeat between adjacent chunks
# Rule of thumb: 10–20% of chunk size

chunk_size = 500   # tokens
overlap    = 50    # tokens — 10% overlap

# Chunk 1: tokens 0-500
# Chunk 2: tokens 450-950  (50 token overlap)
# Chunk 3: tokens 900-1400 (50 token overlap)

# Too little overlap (0): boundary-spanning answers lost
# Too much overlap (50%): doubles storage, slows indexing, redundant retrieval
# Sweet spot: 50-100 tokens for chunk_size=500

OVERLAP VISUALISATION (chunk_size=10 words, overlap=3 words)

The ring buffer in DPDK stores packets waiting to be processed by the worker lcores. Each lcore reads from its dedicated queue without locking overhead.

Chunk 1

Chunk 2

Chunk 3

Overlap (repeated)

🔧

LangChain Text Splitters — The Standard Toolkit

Production Tools

pip install langchain langchain-text-splitters tiktoken

from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    MarkdownHeaderTextSplitter,
    PythonCodeTextSplitter,
    TokenTextSplitter,
)

# ── 1. RecursiveCharacterTextSplitter — your default ──
# Tries to split on: \n\n, \n, " ", "" in that order
# Produces naturally bounded chunks (paragraphs, then sentences)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,       # max characters per chunk
    chunk_overlap=50,    # characters of overlap
    length_function=len,  # use len(str) — swap for token counter
    separators=["\n\n", "\n", " ", ""]   # priority order
)
chunks = splitter.split_text(long_text)
print(f"{len(chunks)} chunks, avg length: {sum(len(c) for c in chunks)//len(chunks)}")

# ── 2. Token-based splitting (recommended for LLM context) ──
# Characters are misleading — tokens are what the LLM actually counts
splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",    # tokeniser to use
    chunk_size=400,        # max TOKENS per chunk
    chunk_overlap=40       # TOKENS of overlap
)
chunks = splitter.split_text(long_text)

# ── 3. Document splitting — preserves metadata ───────
from langchain_core.documents import Document

docs = [Document(page_content=text, metadata={"source": "dpdk_guide.pdf", "page": 1})]
split_docs = splitter.split_documents(docs)
# Each chunk keeps metadata from parent document
print(split_docs[0].metadata)   # {"source": "dpdk_guide.pdf", "page": 1}

📑

Structure-Aware Splitters

Document-Aware

# ── Markdown — split on headers ───────────────────────
from langchain_text_splitters import MarkdownHeaderTextSplitter

headers_to_split_on = [
    ("#",  "h1"),
    ("##", "h2"),
    ("###","h3"),
]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_docs = md_splitter.split_text(markdown_text)
# Each doc has metadata: {"h1": "DPDK Guide", "h2": "Memory Management"}
# This lets you filter by section during retrieval

# Then apply size-based splitting to large sections
secondary_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
final_chunks = secondary_splitter.split_documents(md_docs)

# ── Python code — split on function/class boundaries ──
python_splitter = PythonCodeTextSplitter(chunk_size=1000, chunk_overlap=0)
code_chunks = python_splitter.split_text(python_source_code)
# Splits at: class def, def, comments, then fallback to character

# ── Custom separators for any format ──────────────────
# C/C++ code
cpp_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\nvoid ", "\nstatic ", "\nint ", "\n", " "],
    chunk_size=800, chunk_overlap=80
)

# RST documentation
rst_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n\n", "\n\n", ".. ", "\n"],
    chunk_size=500, chunk_overlap=50
)

🧠

Semantic Chunking — Split on Topic Changes

Highest Quality

pip install langchain-experimental

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

# SemanticChunker splits where embedding similarity drops
# → natural topic boundaries, not arbitrary character counts
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

semantic_splitter = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile",  # "percentile" | "standard_deviation" | "interquartile"
    breakpoint_threshold_amount=95           # split where similarity drop is in top 5%
)
chunks = semantic_splitter.split_text(long_text)

# Trade-offs vs recursive:
# ✓ Better topical coherence — each chunk covers one idea
# ✓ Variable chunk sizes adapt to content
# ✗ 1 API embedding call per sentence — expensive for large docs
# ✗ Slower — not suitable for real-time ingestion
# ✓ Best for: high-value static corpora, legal/medical docs

⚠️ Semantic chunking makes an embedding API call for every sentence. For a 100-page PDF (~5000 sentences), that is 5000 embedding calls before you even start indexing. Use it for small, high-value corpora where retrieval quality matters more than ingestion speed or cost.

📄

Loading Documents — PDF, DOCX, HTML, Markdown

Source Agnostic

pip install pymupdf python-docx beautifulsoup4 unstructured

# ── PDF — PyMuPDF (fastest, best quality) ────────────
import fitz   # PyMuPDF

def load_pdf(path: str) -> list[dict]:
    """Load PDF, return list of {text, page, source} dicts."""
    doc = fitz.open(path)
    pages = []
    for page_num, page in enumerate(doc):
        text = page.get_text("text")   # plain text extraction
        if text.strip():              # skip blank pages
            pages.append({
                "text":   text,
                "page":   page_num + 1,
                "source": path
            })
    doc.close()
    return pages

# ── DOCX ─────────────────────────────────────────────
from docx import Document as DocxDocument

def load_docx(path: str) -> str:
    doc = DocxDocument(path)
    paragraphs = [p.text for p in doc.paragraphs if p.text.strip()]
    return "\n\n".join(paragraphs)

# ── HTML ─────────────────────────────────────────────
from bs4 import BeautifulSoup
import requests

def load_html(url: str) -> str:
    response = requests.get(url, timeout=10)
    soup = BeautifulSoup(response.text, "html.parser")
    # Remove noise elements
    for tag in soup.find_all(["script", "style", "nav", "footer", "header"]):
        tag.decompose()
    return soup.get_text(separator="\n", strip=True)

# ── Markdown ──────────────────────────────────────────
from pathlib import Path

def load_markdown(path: str) -> str:
    return Path(path).read_text(encoding="utf-8")

# ── Directory loader — batch ingest ───────────────────
import os
from pathlib import Path

def load_directory(dir_path: str, extensions: list[str] = [".txt", ".md", ".pdf"]) -> list[dict]:
    docs = []
    for path in Path(dir_path).rglob("*"):
        if path.suffix in extensions and path.is_file():
            try:
                if path.suffix == ".pdf":
                    for page in load_pdf(str(path)):
                        docs.append(page)
                else:
                    text = path.read_text(encoding="utf-8", errors="ignore")
                    docs.append({"text": text, "source": str(path)})
            except Exception as e:
                print(f"Failed to load {path}: {e}")
    return docs

🧹

Text Cleaning — Remove Noise Before Chunking

Quality Gate

import re

def clean_text(text: str) -> str:
    """Clean extracted text before chunking."""
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text)           # collapse spaces/tabs
    text = re.sub(r'\n{3,}', '\n\n', text)     # max 2 consecutive newlines

    # Remove PDF artefacts (page numbers, headers, footers)
    text = re.sub(r'\nPage \d+ of \d+\n', '\n', text)
    text = re.sub(r'\n\d+\n', '\n', text)        # standalone page numbers

    # Remove non-printable characters
    text = re.sub(r'[^\x20-\x7E\n]', ' ', text)

    # Remove URLs if not relevant
    # text = re.sub(r'https?://\S+', '', text)

    return text.strip()

def is_noise_chunk(chunk: str, min_words: int = 10) -> bool:
    """Return True if the chunk is too short or mostly noise to be useful."""
    words = chunk.split()
    if len(words) < min_words:
        return True
    # High ratio of non-alphabetic chars = likely table/figure noise
    alpha_ratio = sum(1 for c in chunk if c.isalpha()) / max(len(chunk), 1)
    if alpha_ratio < 0.4:
        return True
    return False

# Filter after chunking
chunks = [c for c in raw_chunks if not is_noise_chunk(c)]

📦

Unstructured — Universal Document Parser

Production Grade

pip install unstructured[all-docs]

from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

# Auto-detect format and extract elements
# Works for: PDF, DOCX, HTML, PPTX, XLSX, EML, images (with OCR)
elements = partition(filename="document.pdf")

# Each element has a type and metadata
for elem in elements[:5]:
    print(f"{elem.category:15} | {str(elem)[:60]}")
# Title           | DPDK Programmer's Guide
# NarrativeText   | This guide explains the Data Plane Development Kit
# Table           | | Feature | Status | Notes |
# Image           | [Image: figure1.png]

# Chunk by section title — respects document structure
chunks = chunk_by_title(
    elements,
    max_characters=1500,
    new_after_n_chars=800,
    combine_text_under_n_chars=200
)

# Convert to dicts for ingestion
for chunk in chunks:
    text = str(chunk)
    meta = chunk.metadata.to_dict()
    # meta contains: filename, page_number, url, coordinates, etc.

💡 Use Unstructured when document quality matters more than speed. It extracts tables as structured data, ignores headers/footers intelligently, handles multi-column PDFs, and preserves heading hierarchy. The free version handles most formats; the hosted API handles scanned PDFs with OCR.

🏷

Metadata — The Secret Weapon of Good RAG

Often Neglected

Every chunk should carry rich metadata. Metadata enables filtering (only search recent docs), post-retrieval validation (show source), and attribution (cite the page). The chunks stored in your vector DB are only as useful as their metadata.

# Minimum metadata per chunk
chunk_metadata = {
    "source":      "dpdk-programmers-guide-v23.pdf",
    "source_type": "pdf",             # pdf | html | docx | md | code
    "chunk_idx":   42,                # position in document
    "char_count":  487,
}

# Good metadata per chunk (for serious RAG)
chunk_metadata = {
    "source":      "dpdk-programmers-guide-v23.pdf",
    "source_type": "pdf",
    "page":        47,
    "section":     "Memory Management",
    "subsection":  "Mempool Library",
    "chunk_idx":   42,
    "total_chunks": 380,
    "ingested_at": "2024-03-15T10:30:00Z",
    "doc_version": "23.11",
    "language":    "en",
    "token_count": 412,
}

# For web content
web_chunk_metadata = {
    "url":         "https://doc.dpdk.org/guides/prog_guide/mempool_lib.html",
    "title":       "Mempool Library — DPDK documentation",
    "scraped_at":  "2024-03-15",
    "domain":      "doc.dpdk.org",
    "section":     "Programmer's Guide",
}

✨

Contextual Retrieval — LLM-Generated Chunk Summaries

Anthropic Technique

Anthropic published a technique (2024) that dramatically improves retrieval: before embedding each chunk, prepend a short LLM-generated summary that situates the chunk within the full document. This gives the embedding model more context to work with.

# Contextual Retrieval — add document context to each chunk before embedding
# Cost: 1 cheap LLM call per chunk (use Haiku). Quality gain: significant.

CONTEXT_PROMPT = """<document>
{full_document}
</document>

The chunk below is part of this document. Write a short 1-2 sentence
context that situates this chunk within the document. Focus on what
section this is from and what concept it explains.

<chunk>
{chunk}
</chunk>

Context:"""

async def add_context(chunk: str, full_doc: str) -> str:
    """Prepend LLM-generated context to chunk before embedding."""
    response = await haiku_client.messages.create(
        model="claude-3-haiku-20240307",   # cheap fast model
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": CONTEXT_PROMPT.format(full_document=full_doc[:3000], chunk=chunk)
        }]
    )
    context = response.content[0].text.strip()
    return f"{context}\n\n{chunk}"   # context-enriched chunk ready to embed

# Apply to all chunks before embedding
async def enrich_chunks(chunks: list[str], full_doc: str) -> list[str]:
    return await asyncio.gather(*[add_context(c, full_doc) for c in chunks])

💡 This technique is worth the cost. Anthropic reported 49% reduction in retrieval failures on their benchmarks. A chunk saying "This section covers DPDK mempool initialisation. The ring buffer..." retrieves far better than a bare chunk starting mid-explanation without context.

🔄

Complete Ingestion Pipeline — Production Class

Reusable

import asyncio, hashlib, json
from pathlib import Path
from dataclasses import dataclass, field
from typing import Optional
import chromadb
from chromadb.utils import embedding_functions
from langchain_text_splitters import RecursiveCharacterTextSplitter

@dataclass
class IngestionConfig:
    chunk_size:    int   = 500
    chunk_overlap: int   = 50
    min_chunk_len: int   = 100     # discard shorter chunks
    embedding_model: str = "text-embedding-3-small"
    collection_name: str = "documents"
    chroma_path:    str  = "./chroma_db"
    add_context:    bool = False   # enable LLM context enrichment

class DocumentIngestionPipeline:
    def __init__(self, config: IngestionConfig = IngestionConfig()):
        self.config   = config
        self.splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
            model_name="gpt-4",
            chunk_size=config.chunk_size,
            chunk_overlap=config.chunk_overlap,
        )
        self.chroma   = chromadb.PersistentClient(path=config.chroma_path)
        self.ef       = embedding_functions.OpenAIEmbeddingFunction(
            api_key=os.environ["OPENAI_API_KEY"],
            model_name=config.embedding_model
        )
        self.collection = self.chroma.get_or_create_collection(
            name=config.collection_name,
            embedding_function=self.ef,
            metadata={"hnsw:space": "cosine"}
        )

    def _doc_id(self, text: str, meta: dict) -> str:
        """Stable ID based on content hash — prevents duplicate ingestion."""
        key = json.dumps({"text": text, "source": meta.get("source", "")})
        return hashlib.md5(key.encode()).hexdigest()

    def ingest_text(self, text: str, metadata: dict = {}) -> int:
        """Ingest a single text string. Returns number of chunks added."""
        # Clean
        text = clean_text(text)
        if not text.strip():
            return 0

        # Chunk
        chunks = self.splitter.split_text(text)
        chunks = [c for c in chunks if len(c) >= self.config.min_chunk_len]

        if not chunks:
            return 0

        # Build IDs, documents, metadatas
        ids, docs, metas = [], [], []
        for i, chunk in enumerate(chunks):
            chunk_meta = {
                **metadata,
                "chunk_idx":    i,
                "total_chunks": len(chunks),
                "char_count":   len(chunk),
                "ingested_at":  datetime.utcnow().isoformat(),
            }
            ids.append(self._doc_id(chunk, chunk_meta))
            docs.append(chunk)
            metas.append(chunk_meta)

        # Add to ChromaDB (skips existing IDs automatically)
        self.collection.upsert(ids=ids, documents=docs, metadatas=metas)
        return len(chunks)

    def ingest_file(self, path: str) -> int:
        """Auto-detect file type and ingest."""
        path = Path(path)
        meta = {"source": str(path), "filename": path.name}

        if path.suffix == ".pdf":
            total = 0
            for page in load_pdf(str(path)):
                total += self.ingest_text(page["text"], {**meta, "page": page["page"]})
            return total
        elif path.suffix == ".docx":
            text = load_docx(str(path))
        elif path.suffix in (".md", ".txt"):
            text = path.read_text(encoding="utf-8")
        else:
            raise ValueError(f"Unsupported file type: {path.suffix}")

        return self.ingest_text(text, meta)

    def ingest_directory(self, dir_path: str) -> dict:
        """Ingest all supported files in a directory."""
        results = {"files": 0, "chunks": 0, "errors": []}
        for path in Path(dir_path).rglob("*"):
            if path.suffix in (".pdf", ".docx", ".md", ".txt") and path.is_file():
                try:
                    n = self.ingest_file(str(path))
                    results["chunks"] += n
                    results["files"]  += 1
                except Exception as e:
                    results["errors"].append({"file": str(path), "error": str(e)})
        return results

    def query(self, text: str, n_results: int = 5, where: dict = None) -> list[dict]:
        """Semantic search — returns list of {text, score, metadata}."""
        kwargs = {"query_texts": [text], "n_results": n_results,
                  "include": ["documents", "distances", "metadatas"]}
        if where:
            kwargs["where"] = where
        results = self.collection.query(**kwargs)
        return [
            {"text": doc, "score": 1 - dist, "meta": meta}
            for doc, dist, meta in zip(
                results["documents"][0],
                results["distances"][0],
                results["metadatas"][0]
            )
        ]

FREE LEARNING RESOURCES

Type	Resource	Best For
Docs	LangChain Text Splitters — python.langchain.com/docs/concepts/text_splitters	Complete reference for all LangChain splitter types with code examples.
Article	Anthropic: Contextual Retrieval — anthropic.com/news/contextual-retrieval	Anthropic's technique for LLM-enriched chunks. Shows 49% fewer retrieval failures.
Article	Unstructured: Chunking for RAG Best Practices — unstructured.io/blog	Production-tested chunking strategies including chunk size, overlap, and structure awareness.
Article	Weaviate: Chunking Strategies for RAG — weaviate.io/blog	Covers fixed, recursive, and semantic chunking with visual diagrams.
Library	Unstructured.io Docs — docs.unstructured.io	Universal document parser. Handles PDF, DOCX, HTML, PPTX with intelligent element extraction.

MILESTONE PROJECT

🛠 Document Ingestion Pipeline with Chunking Comparison [Intermediate] 3–4 days

Build the reusable DocumentIngestionPipeline class and empirically compare three chunking strategies to understand when each is best.

Part A — Build the Pipeline

Implement the full DocumentIngestionPipeline class from Tab 5
Support: PDF, DOCX, Markdown, plain text ingestion
Embed with OpenAI text-embedding-3-small, store in ChromaDB
Track: files ingested, chunks created, errors, total tokens used
Ingest at least 20 real documents from any domain you care about

Part B — Compare Chunking Strategies

Take 3 long documents and chunk them three ways: fixed-size (500), recursive (500/50), semantic
For each strategy, create a separate ChromaDB collection
Write 10 test queries. For each query, check: (a) does the top-1 chunk contain the answer? (b) does the top-3 contain it? (c) is the returned chunk complete or does it cut off mid-sentence?
Document which strategy works best and why for your document type

Skills: LangChain splitters, PyMuPDF, ChromaDB, batch embedding, metadata design, empirical evaluation

LAB 1

Chunking Parameter Sensitivity — Find the Sweet Spot

Objective: Discover how chunk size and overlap affect retrieval quality on your documents.

Take a 20-page technical document (PDF or Markdown). Write 10 specific questions whose answers you can locate manually in the document. Record the page/section for each answer.

Chunk with 4 configurations and index each in a separate ChromaDB collection: (a) size=200, overlap=0, (b) size=500, overlap=0, (c) size=500, overlap=100, (d) size=1000, overlap=100.

For each configuration, run all 10 questions and check if the answer appears in the top-3 retrieved chunks. Score: 1 point for top-1, 0.5 for top-2/3, 0 for not found.

Build a table: config | score/10 | avg chunk size | num chunks | query time. Which configuration wins? Is the winner different for short factual questions vs long explanatory questions?

Key finding to document: what chunk size and overlap would you use for this document type in production?

LAB 2

Metadata Filtering — See the Quality Jump

Objective: Demonstrate that metadata filtering dramatically improves precision when your collection has multiple sources.

Create a ChromaDB collection with chunks from 3 different domains: (a) DPDK/networking, (b) Python programming, (c) cooking recipes. At least 10 chunks per domain with "domain" metadata field.

Run 5 queries relevant to each domain WITHOUT filtering. Record: how many of the top-5 results are from the correct domain?

Re-run the same 15 queries WITH domain filter (where={"domain": "dpdk"}). Record the same metric.

Compare precision@5: with vs without filtering. Document the improvement. This is the argument for rich metadata in production RAG.

LAB 3

Contextual Retrieval — Measure the Improvement

Objective: Implement Anthropic's contextual retrieval technique and measure how much it improves retrieval.

Take a 10-page document and chunk it into ~30 chunks with RecursiveCharacterTextSplitter.

Create Collection A: index the raw chunks as-is.

For each chunk, call Claude Haiku to generate a 1-2 sentence context using the full document. Prepend the context to the chunk. Create Collection B: index the context-enriched chunks.

Write 10 test questions. For each, query both collections and check if the answer is in top-3. Score both.

Compare: Collection A score vs Collection B score. Also compare: total Haiku API cost for enrichment. Is the quality gain worth the cost for your use case?

P5-M16 MASTERY CHECKLIST

Can explain why chunk size critically affects retrieval quality — too large dilutes signal, too small loses context
Know the 5 chunking strategies and when to use each: fixed, recursive, semantic, document-aware, agentic
Can explain why overlap is needed and choose an appropriate overlap amount (10-20% of chunk size)
Always use RecursiveCharacterTextSplitter as the default — never plain CharacterTextSplitter for prose
Can use from_tiktoken_encoder() to split by tokens rather than characters
Can use MarkdownHeaderTextSplitter to preserve section hierarchy as metadata
Can implement semantic chunking and know when the cost is justified
Can load clean text from PDF (PyMuPDF), DOCX, HTML, and Markdown
Can clean extracted text: remove page numbers, excessive whitespace, non-printable characters
Can filter noise chunks that are too short or low in alphabetic content
Know what Unstructured.io does and when to use it over simple loaders
Include rich metadata with every chunk: source, page, section, chunk_idx, ingested_at, token_count
Can implement the Anthropic contextual retrieval technique and explain the quality tradeoff
Can build a complete DocumentIngestionPipeline that handles load → clean → chunk → embed → store
Use content-hash IDs (MD5) to prevent duplicate chunk ingestion
Completed Lab 1: chunk size sensitivity experiment with scoring
Completed Lab 2: metadata filtering precision comparison
Completed Lab 3: contextual retrieval quality measurement
Milestone project pushed to GitHub with README and chunking comparison findings

✅ When complete: Move to P5-M17 — Retrieval Quality. You now have a solid ingestion pipeline. M17 covers how to improve what comes back from that pipeline: filtering, reranking with Cohere, HyDE, and diagnosing retrieval failures.

← P5-M15: Embeddings & Vector DBs 🗺️ All Modules Next: P5-M17 — Retrieval Quality →