What This Module Covers
RAG FoundationRAG (Retrieval-Augmented Generation) lets LLMs answer questions about your own documents. The foundation of RAG is embeddings — mathematical representations of text that capture meaning — and vector databases that store and search them efficiently. This module teaches you everything you need to build the retrieval layer.
- Embeddings — what they are, how they encode meaning, why similar texts produce similar vectors
- Embedding models — OpenAI text-embedding-3, Cohere embed, HuggingFace sentence-transformers
- Similarity metrics — cosine similarity, dot product, Euclidean distance — when to use each
- Vector databases — ChromaDB, Pinecone, Qdrant, pgvector, FAISS — how to choose
- Indexing and querying — adding documents, querying by semantic similarity, filtering with metadata
- Embedding costs and performance — batch embedding, caching, model selection
Where This Fits in RAG
Architecture# The full RAG pipeline — this module covers the RETRIEVAL box # # INDEXING (offline): RETRIEVAL (online, per query): # # Documents User Question # ↓ ↓ # Chunking (M16) → Embed question (this module) # ↓ ↓ # Embed chunks (this module) → Search vector DB (this module) # ↓ ↓ # Store in Vector DB (this) → Top-K chunks returned # ↓ # Reranking (M17) # ↓ # LLM generates answer (M18)
What Are Embeddings?
Concept FirstAn embedding is a list of floating-point numbers (a vector) that represents the meaning of a piece of text. The embedding model maps semantically similar texts to nearby points in vector space — so "dog" and "canine" are close together, but "dog" and "database" are far apart.
💡 The key insight: you never look at the actual numbers. The magic is that vector distance corresponds to semantic similarity. Two passages about the same topic will have similar vectors even if they use completely different words — enabling semantic search that keyword search cannot match.
Generating Embeddings — OpenAI, Cohere, HuggingFace
Codepip install openai cohere sentence-transformers # ── OpenAI Embeddings ───────────────────────────────── from openai import OpenAI client = OpenAI() def embed_openai(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]: response = client.embeddings.create(input=texts, model=model) return [item.embedding for item in response.data] # Single text vec = embed_openai(["What is DPDK?"])[0] print(f"Dimensions: {len(vec)}") # 1536 for text-embedding-3-small # Batch — much more efficient (one API call for many texts) docs = ["DPDK is a packet processing framework", "VPP runs on DPDK for high-performance networking", "Machine learning uses gradient descent"] vecs = embed_openai(docs) # 3 embeddings, 1 API call # ── Cohere Embeddings ───────────────────────────────── import cohere co = cohere.Client() # COHERE_API_KEY from environment response = co.embed( texts=docs, model="embed-english-v3.0", input_type="search_document" # "search_document" for indexing, "search_query" for queries ) vecs = response.embeddings # list of lists # ── HuggingFace Sentence Transformers (free, local) ─── from sentence_transformers import SentenceTransformer model = SentenceTransformer("all-MiniLM-L6-v2") # 384 dims, fast, free vecs = model.encode(docs, show_progress_bar=True) # numpy arrays print(vecs.shape) # (3, 384) # Better quality, slower: model = SentenceTransformer("BAAI/bge-large-en-v1.5") # 1024 dims, SOTA free model
| Model | Dims | Cost | Best For |
|---|---|---|---|
| text-embedding-3-small | 1536 | $0.02/1M tokens | Default choice — great quality, cheap |
| text-embedding-3-large | 3072 | $0.13/1M tokens | Highest quality, higher cost |
| embed-english-v3.0 (Cohere) | 1024 | $0.10/1M tokens | Best with Cohere reranker (M17) |
| all-MiniLM-L6-v2 | 384 | Free (local) | Prototyping, offline, no API cost |
| BAAI/bge-large-en-v1.5 | 1024 | Free (local) | Best free model quality — production with GPU |
Embedding Best Practices
Production# 1. Always batch — never embed one text at a time in a loop # BAD: 1000 API calls vecs = [embed_openai([text])[0] for text in texts] # GOOD: 1 API call (batch up to 2048 texts) # Batch into chunks of 500 to stay within API limits def embed_batch(texts: list[str], batch_size: int = 500) -> list[list[float]]: all_embeddings = [] for i in range(0, len(texts), batch_size): batch = texts[i:i + batch_size] response = client.embeddings.create(input=batch, model="text-embedding-3-small") all_embeddings.extend([item.embedding for item in response.data]) return all_embeddings # 2. Cache embeddings — never re-embed the same text twice import hashlib, json, sqlite3 def cached_embed(text: str) -> list[float]: key = hashlib.md5(text.encode()).hexdigest() with sqlite3.connect("embeddings.db") as conn: conn.execute("CREATE TABLE IF NOT EXISTS cache (key TEXT PRIMARY KEY, vec TEXT)") row = conn.execute("SELECT vec FROM cache WHERE key=?", (key,)).fetchone() if row: return json.loads(row[0]) vec = embed_openai([text])[0] conn.execute("INSERT INTO cache VALUES (?,?)", (key, json.dumps(vec))) return vec # 3. Use the right input_type (Cohere only) # Documents being indexed: input_type="search_document" # User queries: input_type="search_query" # Using the wrong type degrades retrieval quality # 4. Normalise embeddings before cosine similarity (optional but consistent) import numpy as np def normalise(vec: list[float]) -> list[float]: arr = np.array(vec) return (arr / np.linalg.norm(arr)).tolist()
Similarity Metrics — Cosine, Dot Product, Euclidean
Core Mathimport numpy as np
def cosine_similarity(a: list[float], b: list[float]) -> float:
"""Angle between vectors. Range: -1 to 1. 1 = identical direction."""
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def dot_product(a: list[float], b: list[float]) -> float:
"""Dot product. Equivalent to cosine if vectors are normalised."""
return float(np.dot(np.array(a), np.array(b)))
def euclidean_distance(a: list[float], b: list[float]) -> float:
"""Straight-line distance. Lower = more similar. Range: 0 to inf."""
return float(np.linalg.norm(np.array(a) - np.array(b)))
# Demonstrate: semantically similar texts should be close
vecs = embed_openai([
"DPDK is a fast packet processing framework",
"FD.io DPDK provides high-speed networking",
"Machine learning uses gradient descent optimisation"
])
print(cosine_similarity(vecs[0], vecs[1])) # ~0.91 — very similar
print(cosine_similarity(vecs[0], vecs[2])) # ~0.18 — very different| Metric | Range | More Similar = | Use When |
|---|---|---|---|
| Cosine Similarity | -1 to 1 | Higher (→ 1) | Default for text. Ignores vector magnitude. |
| Dot Product | −∞ to ∞ | Higher | When vectors are normalised (= cosine). Fastest. |
| Euclidean Distance | 0 to ∞ | Lower (→ 0) | Image embeddings, when magnitude matters. |
💡 Use cosine similarity for text embeddings by default. OpenAI recommends it for text-embedding-3 models. Most vector databases default to cosine. Dot product is equivalent and faster when vectors are L2-normalised — many embedding models output normalised vectors.
Brute-Force vs ANN — How Vector DBs Search at Scale
Performance# Brute-force: compare query to EVERY stored vector # O(n × d) — works fine for < 100k vectors, slow for millions def brute_force_search(query_vec, stored_vecs, top_k=5): scores = [(cosine_similarity(query_vec, v), i) for i, v in enumerate(stored_vecs)] scores.sort(reverse=True) return scores[:top_k] # ANN (Approximate Nearest Neighbor) — index structure for fast search # HNSW (Hierarchical Navigable Small World) — used by ChromaDB, Qdrant, Weaviate # IVF (Inverted File Index) — used by FAISS # ANNOY — used by Spotify, disk-friendly # ANN trade-off: slightly approximate results, but 100-1000x faster # In practice: ANN accuracy is >99% with right parameters # Rule of thumb: # < 100k vectors: brute force fine (ChromaDB default) # 100k - 10M: HNSW index (Qdrant, Weaviate) # > 10M: FAISS IVF or managed service (Pinecone)
Choosing a Vector Database
Decision GuideChromaDB
Zero-setup local vector DB. In-memory or persisted. Perfect for prototyping and small-scale RAG. No server needed.
Pinecone
Fully managed, serverless. Free tier. Best for production with no infra overhead. Up to billions of vectors.
Qdrant
Best open-source production option. Rich filtering, HNSW, Rust performance. Self-host or use Qdrant Cloud.
pgvector
Vector search inside PostgreSQL. Best when your data is already in Postgres. No new infra to manage.
FAISS
Meta's vector similarity library. Not a DB — needs wrapping. Fastest raw search for large in-memory indexes.
| Use Case | Recommended | Why |
|---|---|---|
| Prototype / learning | ChromaDB | pip install, no server, works in 5 lines |
| Production (managed) | Pinecone | No infra, scales to billions, SLA |
| Production (self-hosted) | Qdrant | Best OSS quality, rich filters, Docker deploy |
| Already using Postgres | pgvector | Reuse existing DB, ACID, familiar SQL |
| Max performance (large scale) | FAISS | Fastest raw search, GPU support |
ChromaDB — Start Here for Every RAG Project
Prototype to Productionpip install chromadb openai import chromadb from chromadb.utils import embedding_functions # ── In-memory (for tests / notebooks) ──────────────── client = chromadb.Client() # ── Persistent (survives restarts) ─────────────────── client = chromadb.PersistentClient(path="./chroma_db") # ── Use OpenAI embeddings automatically ────────────── oai_ef = embedding_functions.OpenAIEmbeddingFunction( api_key=os.environ["OPENAI_API_KEY"], model_name="text-embedding-3-small" ) # Create or get a collection collection = client.get_or_create_collection( name="docs", embedding_function=oai_ef, # auto-embeds on add/query metadata={"hnsw:space": "cosine"} # use cosine similarity ) # Add documents — Chroma embeds them automatically collection.add( ids=["doc1", "doc2", "doc3"], documents=[ "DPDK is a set of libraries for fast packet processing", "VPP uses DPDK for high-performance networking in telecom", "Python is a general-purpose programming language" ], metadatas=[ {"source": "dpdk_docs", "year": 2024}, {"source": "vpp_docs", "year": 2024}, {"source": "python_docs", "year": 2023}, ] ) # Query — semantic search results = collection.query( query_texts=["how does packet processing work?"], n_results=2, include=["documents", "metadatas", "distances"] ) for doc, meta, dist in zip( results["documents"][0], results["metadatas"][0], results["distances"][0] ): print(f"Score: {1-dist:.3f} | Source: {meta['source']} | {doc[:60]}")
Metadata Filtering — Combine Semantic + Structured Search
Power Feature# Filter by metadata BEFORE semantic search # This is critical for multi-tenant apps or date-filtered search # Only search within dpdk_docs source results = collection.query( query_texts=["packet processing"], n_results=5, where={"source": "dpdk_docs"} # metadata filter ) # Numeric comparison filters results = collection.query( query_texts=["networking architecture"], n_results=5, where={"year": {"$gte": 2024}} # year >= 2024 ) # Boolean operators results = collection.query( query_texts=["high performance networking"], n_results=5, where={"$and": [ {"source": {"$in": ["dpdk_docs", "vpp_docs"]}}, {"year": {"$gte": 2023}} ]} ) # Update and delete collection.update(ids=["doc1"], metadatas=[{"year": 2025}]) collection.delete(ids=["doc3"]) print(collection.count()) # current document count
Pinecone — Managed Vector DB
Cloud Productionpip install pinecone-client from pinecone import Pinecone, ServerlessSpec pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"]) # Create index (one-time setup) pc.create_index( name="my-rag-index", dimension=1536, # must match embedding model dimension metric="cosine", spec=ServerlessSpec(cloud="aws", region="us-east-1") ) index = pc.Index("my-rag-index") # Upsert vectors (create or update) vectors = embed_batch(documents) index.upsert(vectors=[ { "id": f"doc_{i}", "values": vec, "metadata": {"text": doc, "source": "docs", "chunk_idx": i} } for i, (vec, doc) in enumerate(zip(vectors, documents)) ]) # Query query_vec = embed_openai(["packet processing performance"])[0] results = index.query( vector=query_vec, top_k=5, include_metadata=True, filter={"source": {"$eq": "docs"}} ) for match in results["matches"]: print(f"Score: {match['score']:.3f} | {match['metadata']['text'][:60]}") # Index stats print(index.describe_index_stats())
Qdrant — Best Self-Hosted Option
OSS Productionpip install qdrant-client # Start Qdrant locally with Docker: # docker run -p 6333:6333 qdrant/qdrant from qdrant_client import QdrantClient from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue client = QdrantClient(host="localhost", port=6333) # Create collection client.create_collection( collection_name="docs", vectors_config=VectorParams(size=1536, distance=Distance.COSINE), ) # Upsert points vectors = embed_batch(documents) client.upsert( collection_name="docs", points=[ PointStruct( id=i, vector=vec, payload={"text": doc, "source": "dpdk_docs"} ) for i, (vec, doc) in enumerate(zip(vectors, documents)) ] ) # Semantic search with metadata filter query_vec = embed_openai(["DPDK performance"])[0] results = client.search( collection_name="docs", query_vector=query_vec, limit=5, query_filter=Filter( must=[FieldCondition(key="source", match=MatchValue(value="dpdk_docs"))] ), with_payload=True ) for hit in results: print(f"Score: {hit.score:.3f} | {hit.payload['text'][:60]}")
pgvector — Vector Search in PostgreSQL
SQL + Vectors# Install pgvector extension in PostgreSQL # docker run -e POSTGRES_PASSWORD=pass -p 5432:5432 pgvector/pgvector:pg16 pip install psycopg2-binary pgvector import psycopg2 from pgvector.psycopg2 import register_vector import numpy as np conn = psycopg2.connect("postgresql://postgres:pass@localhost/ragdb") register_vector(conn) # Enable extension and create table with conn.cursor() as cur: cur.execute("CREATE EXTENSION IF NOT EXISTS vector") cur.execute(""" CREATE TABLE IF NOT EXISTS documents ( id SERIAL PRIMARY KEY, content TEXT NOT NULL, source TEXT, embedding vector(1536) ) """) cur.execute("CREATE INDEX IF NOT EXISTS docs_embedding_idx ON documents USING ivfflat (embedding vector_cosine_ops)") conn.commit() # Insert documents with embeddings def insert_docs(texts: list[str], source: str): vecs = embed_batch(texts) with conn.cursor() as cur: cur.executemany(""" INSERT INTO documents (content, source, embedding) VALUES (%s, %s, %s) """, [(text, source, vec) for text, vec in zip(texts, vecs)]) conn.commit() # Semantic search — pure SQL! def semantic_search(query: str, top_k: int = 5, source: str = None) -> list[dict]: query_vec = embed_openai([query])[0] source_filter = "AND source = %s" if source else "" params = [query_vec, top_k] if not source else [query_vec, source, top_k] with conn.cursor() as cur: cur.execute(f""" SELECT content, source, 1 - (embedding <=> %s::vector) AS similarity FROM documents {f"WHERE source = %s" if source else ""} ORDER BY embedding <=> %s::vector LIMIT %s """, [query_vec] + ([source] if source else []) + [query_vec, top_k]) rows = cur.fetchall() return [{"content": r[0], "source": r[1], "similarity": r[2]} for r in rows] # pgvector distance operators: # <-> Euclidean distance # <=> Cosine distance (1 - cosine_similarity) # <#> Negative dot product
FAISS — Maximum Performance Library
High Scalepip install faiss-cpu # or faiss-gpu for GPU import faiss import numpy as np # Build an index dimension = 1536 # Flat (brute force) — exact, best for < 100k vectors index = faiss.IndexFlatIP(dimension) # Inner Product (= cosine for normalised) # IVF (Inverted File) — fast approximate, for > 100k vectors nlist = 100 # number of clusters quantiser = faiss.IndexFlatIP(dimension) index = faiss.IndexIVFFlat(quantiser, dimension, nlist, faiss.METRIC_INNER_PRODUCT) # Add vectors (normalised for cosine similarity) vecs = np.array(embed_batch(documents), dtype='float32') faiss.normalize_L2(vecs) # in-place L2 normalisation if isinstance(index, faiss.IndexIVFFlat): index.train(vecs) # IVF index must be trained first index.add(vecs) # Search query_vec = np.array(embed_openai(["packet processing"]), dtype='float32') faiss.normalize_L2(query_vec) distances, indices = index.search(query_vec, k=5) for dist, idx in zip(distances[0], indices[0]): if idx != -1: # -1 means not enough results print(f"Score: {dist:.3f} | {documents[idx][:60]}") # Save and load index faiss.write_index(index, "docs.faiss") index = faiss.read_index("docs.faiss")
⚠️ FAISS does not store document text — only vectors and integer IDs. You must maintain a separate mapping from FAISS index position → document text (a Python list or SQLite table). This is the most common FAISS mistake.
FREE LEARNING RESOURCES
| Type | Resource | Best For |
|---|---|---|
| Article | OpenAI Embeddings Guide — platform.openai.com/docs/guides/embeddings | Best introduction to text embeddings. Covers use cases, models, and similarity metrics with examples. |
| Docs | ChromaDB Documentation — docs.trychroma.com | Complete ChromaDB reference. Start with the Getting Started guide. |
| Docs | Qdrant Documentation — qdrant.tech/documentation | Production-quality vector DB. Excellent filtering and performance documentation. |
| Article | HuggingFace: Getting Started with Embeddings — huggingface.co/blog | Free embedding models with sentence-transformers. Hands-on with real code. |
| Docs | pgvector — github.com/pgvector/pgvector | Vector search in PostgreSQL. README covers all operators and index types. |
MILESTONE PROJECT
Build a complete semantic search engine over a collection of real documents — the foundation layer for your RAG system in M18.
Requirements
- Index at least 50 real documents (PDF or text files from any domain you care about)
- Embed all documents using OpenAI text-embedding-3-small with batch embedding and caching
- Store in ChromaDB with metadata: source, date, category, chunk_idx
- Build a query function:
search(query, top_k=5, filter_source=None)→ returns ranked results with similarity scores - Compare semantic search vs keyword search on 10 queries — show where semantic wins
- FastAPI endpoint:
POST /searchwith Pydantic request/response models
Stretch Goals
- Add a second collection using a free HuggingFace model — compare retrieval quality
- Implement the same search in pgvector — compare query time for 1000 documents
- Add an embedding cache to SQLite — verify zero API calls on re-indexing same documents
Skills: OpenAI embeddings, ChromaDB, batch processing, metadata filtering, FastAPI, Pydantic
Visualise the Embedding Space — Make Semantic Similarity Concrete
Objective: See embeddings as geometry — observe that similar texts cluster together in vector space.
from sklearn.decomposition import PCA; coords = PCA(n_components=2).fit_transform(vecs). Print the 2D coordinates for each text. Do texts cluster as expected?ChromaDB Full Lifecycle — Index, Query, Filter, Update
Objective: Build fluency with ChromaDB by exercising every operation.
Compare Vector DBs — Same Data, Same Queries
Objective: Run the same workload on two different vector DBs and compare the experience.
docker run -p 6333:6333 qdrant/qdrant).P5-M15 MASTERY CHECKLIST
- Can explain what an embedding is — a vector that captures semantic meaning — without referring to math
- Can generate embeddings using OpenAI, Cohere, and a free HuggingFace sentence-transformers model
- Always batch embed — never call the API in a per-text loop
- Know to cache embeddings to SQLite to avoid re-embedding the same content
- Know the difference between cosine similarity, dot product, and Euclidean distance — and which to use for text
- Can explain ANN (approximate nearest neighbor) and why it is faster than brute-force for large collections
- Can choose the right vector DB for a use case: ChromaDB for prototypes, Qdrant for self-hosted production, Pinecone for managed cloud, pgvector if already using Postgres
- Can create a ChromaDB collection, add documents with metadata, and query with semantic + metadata filtering
- Can use Pinecone to upsert vectors, query by similarity, and filter by metadata
- Can use Qdrant with Docker for self-hosted vector search with filters
- Know the pgvector distance operators: <-> (Euclidean), <=> (cosine), <#> (dot product)
- Know that FAISS does not store document text — you must maintain a separate ID→text mapping
- Know that Cohere embeddings require different input_type for documents ("search_document") vs queries ("search_query")
- Completed Lab 1: visualised embedding space with similarity matrix
- Completed Lab 2: ChromaDB full lifecycle including persistence test
- Completed Lab 3: compared two vector DBs on same workload
- Milestone project pushed to GitHub with README
✅ When complete: Move to P5-M16 — Chunking & Document Ingestion. Now you know how to store and search vectors — next you learn how to prepare documents before embedding them.