What This Module Covers
RAG Layer 2Chunking is the most underestimated part of RAG. A beautiful embedding model and fast vector DB will still produce terrible retrieval if your chunks are poorly designed. Chunks that are too large dilute the embedding signal. Chunks that are too small lose context. Chunks that break sentences mid-way confuse the LLM. This module teaches you to get it right.
- Chunking strategies — fixed-size, recursive, semantic, document-aware, agentic — when each is appropriate
- Overlap — why you need it and how much to use
- LangChain text splitters — the standard toolkit for chunking
- Document loaders — extracting clean text from PDF, DOCX, HTML, Markdown, code
- Metadata enrichment — adding source, page, section, headings to every chunk for better filtering
- Full ingestion pipeline — load → clean → chunk → embed → store, as a reusable class
Why Chunking Quality Determines RAG Quality
Motivation# The chunking problem: # # DOCUMENT: 10,000 token technical manual about DPDK # # BAD: chunk = entire document # → embedding averages over everything → signal diluted # → 10,000 tokens fills context → too expensive # # BAD: chunk = 20 tokens (half a sentence) # → embedding has no context → meaningless # → "The ring buffer" has no meaning without surrounding text # # GOOD: chunk = 300-500 tokens (2-4 paragraphs on one topic) # → embedding captures a complete idea # → LLM gets enough context to answer # → small enough for high precision retrieval # The overlap problem: # Without overlap — answers that span chunk boundaries are lost # "The mempool must be... [CHUNK BOUNDARY] ...initialised before the port" # → Neither chunk contains the complete fact # # With 10-20% overlap — boundary-spanning content appears in both chunks # → At least one chunk retrieved will contain the complete answer
The Five Chunking Strategies
Decision FrameworkFixed-Size
Split by token or character count, regardless of content structure. Simple and predictable.
Recursive
Try to split on paragraph breaks, then sentences, then words — preserves natural boundaries when possible.
Semantic
Measure embedding similarity between consecutive sentences — split where similarity drops (topic change).
Document-Aware
Split on structural markers: headings in Markdown, sections in code, HTML tags, PDF pages.
Agentic
LLM decides how to chunk — generates chunk boundaries and summaries. Highest quality, highest cost.
Chunk Size Guidelines
Calibration| Chunk Size | Tokens (approx) | Best For | Risk |
|---|---|---|---|
| Tiny | 50–100 | Keyword-heavy fact retrieval | No context — LLM gets fragments |
| Small ✓ | 200–400 | Q&A, facts, customer support | May miss multi-paragraph answers |
| Medium ✓ | 400–800 | Technical docs, general RAG | Good default — balanced precision/recall |
| Large | 800–1500 | Long-form summaries, analysis | Embedding signal diluted, slow search |
| Whole doc | >1500 | Do not use for RAG | Precision collapse — everything matches |
💡 The golden rule: the chunk should be the smallest unit that can fully answer a likely query. If users ask "What is the DPDK mempool?" — the chunk should contain the complete mempool explanation, not just one sentence about it. Test empirically: try chunk sizes 256, 512, 1024 and measure retrieval precision on real queries.
Overlap — How Much?
# Overlap = how many tokens repeat between adjacent chunks # Rule of thumb: 10–20% of chunk size chunk_size = 500 # tokens overlap = 50 # tokens — 10% overlap # Chunk 1: tokens 0-500 # Chunk 2: tokens 450-950 (50 token overlap) # Chunk 3: tokens 900-1400 (50 token overlap) # Too little overlap (0): boundary-spanning answers lost # Too much overlap (50%): doubles storage, slows indexing, redundant retrieval # Sweet spot: 50-100 tokens for chunk_size=500
LangChain Text Splitters — The Standard Toolkit
Production Toolspip install langchain langchain-text-splitters tiktoken
from langchain_text_splitters import (
RecursiveCharacterTextSplitter,
CharacterTextSplitter,
MarkdownHeaderTextSplitter,
PythonCodeTextSplitter,
TokenTextSplitter,
)
# ── 1. RecursiveCharacterTextSplitter — your default ──
# Tries to split on: \n\n, \n, " ", "" in that order
# Produces naturally bounded chunks (paragraphs, then sentences)
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # max characters per chunk
chunk_overlap=50, # characters of overlap
length_function=len, # use len(str) — swap for token counter
separators=["\n\n", "\n", " ", ""] # priority order
)
chunks = splitter.split_text(long_text)
print(f"{len(chunks)} chunks, avg length: {sum(len(c) for c in chunks)//len(chunks)}")
# ── 2. Token-based splitting (recommended for LLM context) ──
# Characters are misleading — tokens are what the LLM actually counts
splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
model_name="gpt-4", # tokeniser to use
chunk_size=400, # max TOKENS per chunk
chunk_overlap=40 # TOKENS of overlap
)
chunks = splitter.split_text(long_text)
# ── 3. Document splitting — preserves metadata ───────
from langchain_core.documents import Document
docs = [Document(page_content=text, metadata={"source": "dpdk_guide.pdf", "page": 1})]
split_docs = splitter.split_documents(docs)
# Each chunk keeps metadata from parent document
print(split_docs[0].metadata) # {"source": "dpdk_guide.pdf", "page": 1}Structure-Aware Splitters
Document-Aware# ── Markdown — split on headers ─────────────────────── from langchain_text_splitters import MarkdownHeaderTextSplitter headers_to_split_on = [ ("#", "h1"), ("##", "h2"), ("###","h3"), ] md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on) md_docs = md_splitter.split_text(markdown_text) # Each doc has metadata: {"h1": "DPDK Guide", "h2": "Memory Management"} # This lets you filter by section during retrieval # Then apply size-based splitting to large sections secondary_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50) final_chunks = secondary_splitter.split_documents(md_docs) # ── Python code — split on function/class boundaries ── python_splitter = PythonCodeTextSplitter(chunk_size=1000, chunk_overlap=0) code_chunks = python_splitter.split_text(python_source_code) # Splits at: class def, def, comments, then fallback to character # ── Custom separators for any format ────────────────── # C/C++ code cpp_splitter = RecursiveCharacterTextSplitter( separators=["\n\n", "\nvoid ", "\nstatic ", "\nint ", "\n", " "], chunk_size=800, chunk_overlap=80 ) # RST documentation rst_splitter = RecursiveCharacterTextSplitter( separators=["\n\n\n", "\n\n", ".. ", "\n"], chunk_size=500, chunk_overlap=50 )
Semantic Chunking — Split on Topic Changes
Highest Qualitypip install langchain-experimental from langchain_experimental.text_splitter import SemanticChunker from langchain_openai import OpenAIEmbeddings # SemanticChunker splits where embedding similarity drops # → natural topic boundaries, not arbitrary character counts embeddings = OpenAIEmbeddings(model="text-embedding-3-small") semantic_splitter = SemanticChunker( embeddings, breakpoint_threshold_type="percentile", # "percentile" | "standard_deviation" | "interquartile" breakpoint_threshold_amount=95 # split where similarity drop is in top 5% ) chunks = semantic_splitter.split_text(long_text) # Trade-offs vs recursive: # ✓ Better topical coherence — each chunk covers one idea # ✓ Variable chunk sizes adapt to content # ✗ 1 API embedding call per sentence — expensive for large docs # ✗ Slower — not suitable for real-time ingestion # ✓ Best for: high-value static corpora, legal/medical docs
⚠️ Semantic chunking makes an embedding API call for every sentence. For a 100-page PDF (~5000 sentences), that is 5000 embedding calls before you even start indexing. Use it for small, high-value corpora where retrieval quality matters more than ingestion speed or cost.
Loading Documents — PDF, DOCX, HTML, Markdown
Source Agnosticpip install pymupdf python-docx beautifulsoup4 unstructured # ── PDF — PyMuPDF (fastest, best quality) ──────────── import fitz # PyMuPDF def load_pdf(path: str) -> list[dict]: """Load PDF, return list of {text, page, source} dicts.""" doc = fitz.open(path) pages = [] for page_num, page in enumerate(doc): text = page.get_text("text") # plain text extraction if text.strip(): # skip blank pages pages.append({ "text": text, "page": page_num + 1, "source": path }) doc.close() return pages # ── DOCX ───────────────────────────────────────────── from docx import Document as DocxDocument def load_docx(path: str) -> str: doc = DocxDocument(path) paragraphs = [p.text for p in doc.paragraphs if p.text.strip()] return "\n\n".join(paragraphs) # ── HTML ───────────────────────────────────────────── from bs4 import BeautifulSoup import requests def load_html(url: str) -> str: response = requests.get(url, timeout=10) soup = BeautifulSoup(response.text, "html.parser") # Remove noise elements for tag in soup.find_all(["script", "style", "nav", "footer", "header"]): tag.decompose() return soup.get_text(separator="\n", strip=True) # ── Markdown ────────────────────────────────────────── from pathlib import Path def load_markdown(path: str) -> str: return Path(path).read_text(encoding="utf-8") # ── Directory loader — batch ingest ─────────────────── import os from pathlib import Path def load_directory(dir_path: str, extensions: list[str] = [".txt", ".md", ".pdf"]) -> list[dict]: docs = [] for path in Path(dir_path).rglob("*"): if path.suffix in extensions and path.is_file(): try: if path.suffix == ".pdf": for page in load_pdf(str(path)): docs.append(page) else: text = path.read_text(encoding="utf-8", errors="ignore") docs.append({"text": text, "source": str(path)}) except Exception as e: print(f"Failed to load {path}: {e}") return docs
Text Cleaning — Remove Noise Before Chunking
Quality Gateimport re
def clean_text(text: str) -> str:
"""Clean extracted text before chunking."""
# Remove excessive whitespace
text = re.sub(r'\s+', ' ', text) # collapse spaces/tabs
text = re.sub(r'\n{3,}', '\n\n', text) # max 2 consecutive newlines
# Remove PDF artefacts (page numbers, headers, footers)
text = re.sub(r'\nPage \d+ of \d+\n', '\n', text)
text = re.sub(r'\n\d+\n', '\n', text) # standalone page numbers
# Remove non-printable characters
text = re.sub(r'[^\x20-\x7E\n]', ' ', text)
# Remove URLs if not relevant
# text = re.sub(r'https?://\S+', '', text)
return text.strip()
def is_noise_chunk(chunk: str, min_words: int = 10) -> bool:
"""Return True if the chunk is too short or mostly noise to be useful."""
words = chunk.split()
if len(words) < min_words:
return True
# High ratio of non-alphabetic chars = likely table/figure noise
alpha_ratio = sum(1 for c in chunk if c.isalpha()) / max(len(chunk), 1)
if alpha_ratio < 0.4:
return True
return False
# Filter after chunking
chunks = [c for c in raw_chunks if not is_noise_chunk(c)]Unstructured — Universal Document Parser
Production Gradepip install unstructured[all-docs] from unstructured.partition.auto import partition from unstructured.chunking.title import chunk_by_title # Auto-detect format and extract elements # Works for: PDF, DOCX, HTML, PPTX, XLSX, EML, images (with OCR) elements = partition(filename="document.pdf") # Each element has a type and metadata for elem in elements[:5]: print(f"{elem.category:15} | {str(elem)[:60]}") # Title | DPDK Programmer's Guide # NarrativeText | This guide explains the Data Plane Development Kit # Table | | Feature | Status | Notes | # Image | [Image: figure1.png] # Chunk by section title — respects document structure chunks = chunk_by_title( elements, max_characters=1500, new_after_n_chars=800, combine_text_under_n_chars=200 ) # Convert to dicts for ingestion for chunk in chunks: text = str(chunk) meta = chunk.metadata.to_dict() # meta contains: filename, page_number, url, coordinates, etc.
💡 Use Unstructured when document quality matters more than speed. It extracts tables as structured data, ignores headers/footers intelligently, handles multi-column PDFs, and preserves heading hierarchy. The free version handles most formats; the hosted API handles scanned PDFs with OCR.
Metadata — The Secret Weapon of Good RAG
Often NeglectedEvery chunk should carry rich metadata. Metadata enables filtering (only search recent docs), post-retrieval validation (show source), and attribution (cite the page). The chunks stored in your vector DB are only as useful as their metadata.
# Minimum metadata per chunk chunk_metadata = { "source": "dpdk-programmers-guide-v23.pdf", "source_type": "pdf", # pdf | html | docx | md | code "chunk_idx": 42, # position in document "char_count": 487, } # Good metadata per chunk (for serious RAG) chunk_metadata = { "source": "dpdk-programmers-guide-v23.pdf", "source_type": "pdf", "page": 47, "section": "Memory Management", "subsection": "Mempool Library", "chunk_idx": 42, "total_chunks": 380, "ingested_at": "2024-03-15T10:30:00Z", "doc_version": "23.11", "language": "en", "token_count": 412, } # For web content web_chunk_metadata = { "url": "https://doc.dpdk.org/guides/prog_guide/mempool_lib.html", "title": "Mempool Library — DPDK documentation", "scraped_at": "2024-03-15", "domain": "doc.dpdk.org", "section": "Programmer's Guide", }
Contextual Retrieval — LLM-Generated Chunk Summaries
Anthropic TechniqueAnthropic published a technique (2024) that dramatically improves retrieval: before embedding each chunk, prepend a short LLM-generated summary that situates the chunk within the full document. This gives the embedding model more context to work with.
# Contextual Retrieval — add document context to each chunk before embedding # Cost: 1 cheap LLM call per chunk (use Haiku). Quality gain: significant. CONTEXT_PROMPT = """<document> {full_document} </document> The chunk below is part of this document. Write a short 1-2 sentence context that situates this chunk within the document. Focus on what section this is from and what concept it explains. <chunk> {chunk} </chunk> Context:""" async def add_context(chunk: str, full_doc: str) -> str: """Prepend LLM-generated context to chunk before embedding.""" response = await haiku_client.messages.create( model="claude-3-haiku-20240307", # cheap fast model max_tokens=100, messages=[{ "role": "user", "content": CONTEXT_PROMPT.format(full_document=full_doc[:3000], chunk=chunk) }] ) context = response.content[0].text.strip() return f"{context}\n\n{chunk}" # context-enriched chunk ready to embed # Apply to all chunks before embedding async def enrich_chunks(chunks: list[str], full_doc: str) -> list[str]: return await asyncio.gather(*[add_context(c, full_doc) for c in chunks])
💡 This technique is worth the cost. Anthropic reported 49% reduction in retrieval failures on their benchmarks. A chunk saying "This section covers DPDK mempool initialisation. The ring buffer..." retrieves far better than a bare chunk starting mid-explanation without context.
Complete Ingestion Pipeline — Production Class
Reusableimport asyncio, hashlib, json
from pathlib import Path
from dataclasses import dataclass, field
from typing import Optional
import chromadb
from chromadb.utils import embedding_functions
from langchain_text_splitters import RecursiveCharacterTextSplitter
@dataclass
class IngestionConfig:
chunk_size: int = 500
chunk_overlap: int = 50
min_chunk_len: int = 100 # discard shorter chunks
embedding_model: str = "text-embedding-3-small"
collection_name: str = "documents"
chroma_path: str = "./chroma_db"
add_context: bool = False # enable LLM context enrichment
class DocumentIngestionPipeline:
def __init__(self, config: IngestionConfig = IngestionConfig()):
self.config = config
self.splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
model_name="gpt-4",
chunk_size=config.chunk_size,
chunk_overlap=config.chunk_overlap,
)
self.chroma = chromadb.PersistentClient(path=config.chroma_path)
self.ef = embedding_functions.OpenAIEmbeddingFunction(
api_key=os.environ["OPENAI_API_KEY"],
model_name=config.embedding_model
)
self.collection = self.chroma.get_or_create_collection(
name=config.collection_name,
embedding_function=self.ef,
metadata={"hnsw:space": "cosine"}
)
def _doc_id(self, text: str, meta: dict) -> str:
"""Stable ID based on content hash — prevents duplicate ingestion."""
key = json.dumps({"text": text, "source": meta.get("source", "")})
return hashlib.md5(key.encode()).hexdigest()
def ingest_text(self, text: str, metadata: dict = {}) -> int:
"""Ingest a single text string. Returns number of chunks added."""
# Clean
text = clean_text(text)
if not text.strip():
return 0
# Chunk
chunks = self.splitter.split_text(text)
chunks = [c for c in chunks if len(c) >= self.config.min_chunk_len]
if not chunks:
return 0
# Build IDs, documents, metadatas
ids, docs, metas = [], [], []
for i, chunk in enumerate(chunks):
chunk_meta = {
**metadata,
"chunk_idx": i,
"total_chunks": len(chunks),
"char_count": len(chunk),
"ingested_at": datetime.utcnow().isoformat(),
}
ids.append(self._doc_id(chunk, chunk_meta))
docs.append(chunk)
metas.append(chunk_meta)
# Add to ChromaDB (skips existing IDs automatically)
self.collection.upsert(ids=ids, documents=docs, metadatas=metas)
return len(chunks)
def ingest_file(self, path: str) -> int:
"""Auto-detect file type and ingest."""
path = Path(path)
meta = {"source": str(path), "filename": path.name}
if path.suffix == ".pdf":
total = 0
for page in load_pdf(str(path)):
total += self.ingest_text(page["text"], {**meta, "page": page["page"]})
return total
elif path.suffix == ".docx":
text = load_docx(str(path))
elif path.suffix in (".md", ".txt"):
text = path.read_text(encoding="utf-8")
else:
raise ValueError(f"Unsupported file type: {path.suffix}")
return self.ingest_text(text, meta)
def ingest_directory(self, dir_path: str) -> dict:
"""Ingest all supported files in a directory."""
results = {"files": 0, "chunks": 0, "errors": []}
for path in Path(dir_path).rglob("*"):
if path.suffix in (".pdf", ".docx", ".md", ".txt") and path.is_file():
try:
n = self.ingest_file(str(path))
results["chunks"] += n
results["files"] += 1
except Exception as e:
results["errors"].append({"file": str(path), "error": str(e)})
return results
def query(self, text: str, n_results: int = 5, where: dict = None) -> list[dict]:
"""Semantic search — returns list of {text, score, metadata}."""
kwargs = {"query_texts": [text], "n_results": n_results,
"include": ["documents", "distances", "metadatas"]}
if where:
kwargs["where"] = where
results = self.collection.query(**kwargs)
return [
{"text": doc, "score": 1 - dist, "meta": meta}
for doc, dist, meta in zip(
results["documents"][0],
results["distances"][0],
results["metadatas"][0]
)
]FREE LEARNING RESOURCES
| Type | Resource | Best For |
|---|---|---|
| Docs | LangChain Text Splitters — python.langchain.com/docs/concepts/text_splitters | Complete reference for all LangChain splitter types with code examples. |
| Article | Anthropic: Contextual Retrieval — anthropic.com/news/contextual-retrieval | Anthropic's technique for LLM-enriched chunks. Shows 49% fewer retrieval failures. |
| Article | Unstructured: Chunking for RAG Best Practices — unstructured.io/blog | Production-tested chunking strategies including chunk size, overlap, and structure awareness. |
| Article | Weaviate: Chunking Strategies for RAG — weaviate.io/blog | Covers fixed, recursive, and semantic chunking with visual diagrams. |
| Library | Unstructured.io Docs — docs.unstructured.io | Universal document parser. Handles PDF, DOCX, HTML, PPTX with intelligent element extraction. |
MILESTONE PROJECT
Build the reusable DocumentIngestionPipeline class and empirically compare three chunking strategies to understand when each is best.
Part A — Build the Pipeline
- Implement the full
DocumentIngestionPipelineclass from Tab 5 - Support: PDF, DOCX, Markdown, plain text ingestion
- Embed with OpenAI text-embedding-3-small, store in ChromaDB
- Track: files ingested, chunks created, errors, total tokens used
- Ingest at least 20 real documents from any domain you care about
Part B — Compare Chunking Strategies
- Take 3 long documents and chunk them three ways: fixed-size (500), recursive (500/50), semantic
- For each strategy, create a separate ChromaDB collection
- Write 10 test queries. For each query, check: (a) does the top-1 chunk contain the answer? (b) does the top-3 contain it? (c) is the returned chunk complete or does it cut off mid-sentence?
- Document which strategy works best and why for your document type
Skills: LangChain splitters, PyMuPDF, ChromaDB, batch embedding, metadata design, empirical evaluation
Chunking Parameter Sensitivity — Find the Sweet Spot
Objective: Discover how chunk size and overlap affect retrieval quality on your documents.
Metadata Filtering — See the Quality Jump
Objective: Demonstrate that metadata filtering dramatically improves precision when your collection has multiple sources.
Contextual Retrieval — Measure the Improvement
Objective: Implement Anthropic's contextual retrieval technique and measure how much it improves retrieval.
P5-M16 MASTERY CHECKLIST
- Can explain why chunk size critically affects retrieval quality — too large dilutes signal, too small loses context
- Know the 5 chunking strategies and when to use each: fixed, recursive, semantic, document-aware, agentic
- Can explain why overlap is needed and choose an appropriate overlap amount (10-20% of chunk size)
- Always use RecursiveCharacterTextSplitter as the default — never plain CharacterTextSplitter for prose
- Can use from_tiktoken_encoder() to split by tokens rather than characters
- Can use MarkdownHeaderTextSplitter to preserve section hierarchy as metadata
- Can implement semantic chunking and know when the cost is justified
- Can load clean text from PDF (PyMuPDF), DOCX, HTML, and Markdown
- Can clean extracted text: remove page numbers, excessive whitespace, non-printable characters
- Can filter noise chunks that are too short or low in alphabetic content
- Know what Unstructured.io does and when to use it over simple loaders
- Include rich metadata with every chunk: source, page, section, chunk_idx, ingested_at, token_count
- Can implement the Anthropic contextual retrieval technique and explain the quality tradeoff
- Can build a complete DocumentIngestionPipeline that handles load → clean → chunk → embed → store
- Use content-hash IDs (MD5) to prevent duplicate chunk ingestion
- Completed Lab 1: chunk size sensitivity experiment with scoring
- Completed Lab 2: metadata filtering precision comparison
- Completed Lab 3: contextual retrieval quality measurement
- Milestone project pushed to GitHub with README and chunking comparison findings
✅ When complete: Move to P5-M17 — Retrieval Quality. You now have a solid ingestion pipeline. M17 covers how to improve what comes back from that pipeline: filtering, reranking with Cohere, HyDE, and diagnosing retrieval failures.