P5-M15 - Embeddings & Vector Databases

Part 5 — RAG Systems · Module 15 of 18

Embeddings & Vector Databases

Turn text into searchable numbers — the foundation of every RAG system

⏱ 1 Week 🟡 Intermediate 🔧 ChromaDB · Pinecone · pgvector · FAISS 📋 Prerequisite: P4 Complete

🎯

What This Module Covers

RAG Foundation

RAG (Retrieval-Augmented Generation) lets LLMs answer questions about your own documents. The foundation of RAG is embeddings — mathematical representations of text that capture meaning — and vector databases that store and search them efficiently. This module teaches you everything you need to build the retrieval layer.

Embeddings — what they are, how they encode meaning, why similar texts produce similar vectors
Embedding models — OpenAI text-embedding-3, Cohere embed, HuggingFace sentence-transformers
Similarity metrics — cosine similarity, dot product, Euclidean distance — when to use each
Vector databases — ChromaDB, Pinecone, Qdrant, pgvector, FAISS — how to choose
Indexing and querying — adding documents, querying by semantic similarity, filtering with metadata
Embedding costs and performance — batch embedding, caching, model selection

🗺️

Where This Fits in RAG

Architecture

# The full RAG pipeline — this module covers the RETRIEVAL box
#
# INDEXING (offline):                RETRIEVAL (online, per query):
#
# Documents                          User Question
#    ↓                                    ↓
# Chunking (M16)           →    Embed question (this module)
#    ↓                                    ↓
# Embed chunks (this module) →   Search vector DB (this module)
#    ↓                                    ↓
# Store in Vector DB (this) →    Top-K chunks returned
#                                         ↓
#                               Reranking (M17)
#                                         ↓
#                               LLM generates answer (M18)

🔢

What Are Embeddings?

Concept First

An embedding is a list of floating-point numbers (a vector) that represents the meaning of a piece of text. The embedding model maps semantically similar texts to nearby points in vector space — so "dog" and "canine" are close together, but "dog" and "database" are far apart.

"dog"

→

[0.82, -0.14, 0.33, 0.67, ...]
1536 dimensions

"canine"

→

[0.79, -0.11, 0.31, 0.71, ...]
← very close to "dog"

"database"

→

[-0.23, 0.88, -0.45, 0.12, ...]
← far from "dog"

💡 The key insight: you never look at the actual numbers. The magic is that vector distance corresponds to semantic similarity. Two passages about the same topic will have similar vectors even if they use completely different words — enabling semantic search that keyword search cannot match.

⚙️

Generating Embeddings — OpenAI, Cohere, HuggingFace

Code

pip install openai cohere sentence-transformers

# ── OpenAI Embeddings ─────────────────────────────────
from openai import OpenAI
client = OpenAI()

def embed_openai(texts: list[str], model: str = "text-embedding-3-small") -> list[list[float]]:
    response = client.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

# Single text
vec = embed_openai(["What is DPDK?"])[0]
print(f"Dimensions: {len(vec)}")   # 1536 for text-embedding-3-small

# Batch — much more efficient (one API call for many texts)
docs = ["DPDK is a packet processing framework",
        "VPP runs on DPDK for high-performance networking",
        "Machine learning uses gradient descent"]
vecs = embed_openai(docs)   # 3 embeddings, 1 API call

# ── Cohere Embeddings ─────────────────────────────────
import cohere
co = cohere.Client()   # COHERE_API_KEY from environment

response = co.embed(
    texts=docs,
    model="embed-english-v3.0",
    input_type="search_document"   # "search_document" for indexing, "search_query" for queries
)
vecs = response.embeddings   # list of lists

# ── HuggingFace Sentence Transformers (free, local) ───
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")   # 384 dims, fast, free
vecs = model.encode(docs, show_progress_bar=True)   # numpy arrays
print(vecs.shape)   # (3, 384)

# Better quality, slower:
model = SentenceTransformer("BAAI/bge-large-en-v1.5")   # 1024 dims, SOTA free model

Model	Dims	Cost	Best For
text-embedding-3-small	1536	$0.02/1M tokens	Default choice — great quality, cheap
text-embedding-3-large	3072	$0.13/1M tokens	Highest quality, higher cost
embed-english-v3.0 (Cohere)	1024	$0.10/1M tokens	Best with Cohere reranker (M17)
all-MiniLM-L6-v2	384	Free (local)	Prototyping, offline, no API cost
BAAI/bge-large-en-v1.5	1024	Free (local)	Best free model quality — production with GPU

💡

Embedding Best Practices

Production

# 1. Always batch — never embed one text at a time in a loop
# BAD: 1000 API calls
vecs = [embed_openai([text])[0] for text in texts]

# GOOD: 1 API call (batch up to 2048 texts)
# Batch into chunks of 500 to stay within API limits
def embed_batch(texts: list[str], batch_size: int = 500) -> list[list[float]]:
    all_embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = client.embeddings.create(input=batch, model="text-embedding-3-small")
        all_embeddings.extend([item.embedding for item in response.data])
    return all_embeddings

# 2. Cache embeddings — never re-embed the same text twice
import hashlib, json, sqlite3

def cached_embed(text: str) -> list[float]:
    key = hashlib.md5(text.encode()).hexdigest()
    with sqlite3.connect("embeddings.db") as conn:
        conn.execute("CREATE TABLE IF NOT EXISTS cache (key TEXT PRIMARY KEY, vec TEXT)")
        row = conn.execute("SELECT vec FROM cache WHERE key=?", (key,)).fetchone()
        if row:
            return json.loads(row[0])
        vec = embed_openai([text])[0]
        conn.execute("INSERT INTO cache VALUES (?,?)", (key, json.dumps(vec)))
        return vec

# 3. Use the right input_type (Cohere only)
# Documents being indexed: input_type="search_document"
# User queries: input_type="search_query"
# Using the wrong type degrades retrieval quality

# 4. Normalise embeddings before cosine similarity (optional but consistent)
import numpy as np

def normalise(vec: list[float]) -> list[float]:
    arr = np.array(vec)
    return (arr / np.linalg.norm(arr)).tolist()

🔍

Similarity Metrics — Cosine, Dot Product, Euclidean

Core Math

import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Angle between vectors. Range: -1 to 1. 1 = identical direction."""
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

def dot_product(a: list[float], b: list[float]) -> float:
    """Dot product. Equivalent to cosine if vectors are normalised."""
    return float(np.dot(np.array(a), np.array(b)))

def euclidean_distance(a: list[float], b: list[float]) -> float:
    """Straight-line distance. Lower = more similar. Range: 0 to inf."""
    return float(np.linalg.norm(np.array(a) - np.array(b)))

# Demonstrate: semantically similar texts should be close
vecs = embed_openai([
    "DPDK is a fast packet processing framework",
    "FD.io DPDK provides high-speed networking",
    "Machine learning uses gradient descent optimisation"
])

print(cosine_similarity(vecs[0], vecs[1]))  # ~0.91 — very similar
print(cosine_similarity(vecs[0], vecs[2]))  # ~0.18 — very different

Metric	Range	More Similar =	Use When
Cosine Similarity	-1 to 1	Higher (→ 1)	Default for text. Ignores vector magnitude.
Dot Product	−∞ to ∞	Higher	When vectors are normalised (= cosine). Fastest.
Euclidean Distance	0 to ∞	Lower (→ 0)	Image embeddings, when magnitude matters.

💡 Use cosine similarity for text embeddings by default. OpenAI recommends it for text-embedding-3 models. Most vector databases default to cosine. Dot product is equivalent and faster when vectors are L2-normalised — many embedding models output normalised vectors.

⚡

Brute-Force vs ANN — How Vector DBs Search at Scale

Performance

# Brute-force: compare query to EVERY stored vector
# O(n × d) — works fine for < 100k vectors, slow for millions
def brute_force_search(query_vec, stored_vecs, top_k=5):
    scores = [(cosine_similarity(query_vec, v), i)
              for i, v in enumerate(stored_vecs)]
    scores.sort(reverse=True)
    return scores[:top_k]

# ANN (Approximate Nearest Neighbor) — index structure for fast search
# HNSW (Hierarchical Navigable Small World) — used by ChromaDB, Qdrant, Weaviate
# IVF (Inverted File Index) — used by FAISS
# ANNOY — used by Spotify, disk-friendly

# ANN trade-off: slightly approximate results, but 100-1000x faster
# In practice: ANN accuracy is >99% with right parameters

# Rule of thumb:
# < 100k vectors:   brute force fine (ChromaDB default)
# 100k - 10M:       HNSW index (Qdrant, Weaviate)
# > 10M:            FAISS IVF or managed service (Pinecone)

🗄

Choosing a Vector Database

Decision Guide

Local / OSS

ChromaDB

Zero-setup local vector DB. In-memory or persisted. Perfect for prototyping and small-scale RAG. No server needed.

Managed Cloud

Pinecone

Fully managed, serverless. Free tier. Best for production with no infra overhead. Up to billions of vectors.

OSS / Cloud

Qdrant

Best open-source production option. Rich filtering, HNSW, Rust performance. Self-host or use Qdrant Cloud.

PostgreSQL

pgvector

Vector search inside PostgreSQL. Best when your data is already in Postgres. No new infra to manage.

Library

FAISS

Meta's vector similarity library. Not a DB — needs wrapping. Fastest raw search for large in-memory indexes.

Use Case	Recommended	Why
Prototype / learning	ChromaDB	pip install, no server, works in 5 lines
Production (managed)	Pinecone	No infra, scales to billions, SLA
Production (self-hosted)	Qdrant	Best OSS quality, rich filters, Docker deploy
Already using Postgres	pgvector	Reuse existing DB, ACID, familiar SQL
Max performance (large scale)	FAISS	Fastest raw search, GPU support

🟢

ChromaDB — Start Here for Every RAG Project

Prototype to Production

pip install chromadb openai

import chromadb
from chromadb.utils import embedding_functions

# ── In-memory (for tests / notebooks) ────────────────
client = chromadb.Client()

# ── Persistent (survives restarts) ───────────────────
client = chromadb.PersistentClient(path="./chroma_db")

# ── Use OpenAI embeddings automatically ──────────────
oai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key=os.environ["OPENAI_API_KEY"],
    model_name="text-embedding-3-small"
)

# Create or get a collection
collection = client.get_or_create_collection(
    name="docs",
    embedding_function=oai_ef,           # auto-embeds on add/query
    metadata={"hnsw:space": "cosine"}   # use cosine similarity
)

# Add documents — Chroma embeds them automatically
collection.add(
    ids=["doc1", "doc2", "doc3"],
    documents=[
        "DPDK is a set of libraries for fast packet processing",
        "VPP uses DPDK for high-performance networking in telecom",
        "Python is a general-purpose programming language"
    ],
    metadatas=[
        {"source": "dpdk_docs", "year": 2024},
        {"source": "vpp_docs",  "year": 2024},
        {"source": "python_docs", "year": 2023},
    ]
)

# Query — semantic search
results = collection.query(
    query_texts=["how does packet processing work?"],
    n_results=2,
    include=["documents", "metadatas", "distances"]
)
for doc, meta, dist in zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0]
):
    print(f"Score: {1-dist:.3f} | Source: {meta['source']} | {doc[:60]}")

🔍

Metadata Filtering — Combine Semantic + Structured Search

Power Feature

# Filter by metadata BEFORE semantic search
# This is critical for multi-tenant apps or date-filtered search

# Only search within dpdk_docs source
results = collection.query(
    query_texts=["packet processing"],
    n_results=5,
    where={"source": "dpdk_docs"}   # metadata filter
)

# Numeric comparison filters
results = collection.query(
    query_texts=["networking architecture"],
    n_results=5,
    where={"year": {"$gte": 2024}}   # year >= 2024
)

# Boolean operators
results = collection.query(
    query_texts=["high performance networking"],
    n_results=5,
    where={"$and": [
        {"source": {"$in": ["dpdk_docs", "vpp_docs"]}},
        {"year": {"$gte": 2023}}
    ]}
)

# Update and delete
collection.update(ids=["doc1"], metadatas=[{"year": 2025}])
collection.delete(ids=["doc3"])
print(collection.count())   # current document count

📌

Pinecone — Managed Vector DB

Cloud Production

pip install pinecone-client

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

# Create index (one-time setup)
pc.create_index(
    name="my-rag-index",
    dimension=1536,            # must match embedding model dimension
    metric="cosine",
    spec=ServerlessSpec(cloud="aws", region="us-east-1")
)

index = pc.Index("my-rag-index")

# Upsert vectors (create or update)
vectors = embed_batch(documents)
index.upsert(vectors=[
    {
        "id":     f"doc_{i}",
        "values": vec,
        "metadata": {"text": doc, "source": "docs", "chunk_idx": i}
    }
    for i, (vec, doc) in enumerate(zip(vectors, documents))
])

# Query
query_vec = embed_openai(["packet processing performance"])[0]
results = index.query(
    vector=query_vec,
    top_k=5,
    include_metadata=True,
    filter={"source": {"$eq": "docs"}}
)

for match in results["matches"]:
    print(f"Score: {match['score']:.3f} | {match['metadata']['text'][:60]}")

# Index stats
print(index.describe_index_stats())

🔷

Qdrant — Best Self-Hosted Option

OSS Production

pip install qdrant-client

# Start Qdrant locally with Docker:
# docker run -p 6333:6333 qdrant/qdrant

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue

client = QdrantClient(host="localhost", port=6333)

# Create collection
client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

# Upsert points
vectors = embed_batch(documents)
client.upsert(
    collection_name="docs",
    points=[
        PointStruct(
            id=i,
            vector=vec,
            payload={"text": doc, "source": "dpdk_docs"}
        )
        for i, (vec, doc) in enumerate(zip(vectors, documents))
    ]
)

# Semantic search with metadata filter
query_vec = embed_openai(["DPDK performance"])[0]
results = client.search(
    collection_name="docs",
    query_vector=query_vec,
    limit=5,
    query_filter=Filter(
        must=[FieldCondition(key="source", match=MatchValue(value="dpdk_docs"))]
    ),
    with_payload=True
)
for hit in results:
    print(f"Score: {hit.score:.3f} | {hit.payload['text'][:60]}")

🐘

pgvector — Vector Search in PostgreSQL

SQL + Vectors

# Install pgvector extension in PostgreSQL
# docker run -e POSTGRES_PASSWORD=pass -p 5432:5432 pgvector/pgvector:pg16

pip install psycopg2-binary pgvector

import psycopg2
from pgvector.psycopg2 import register_vector
import numpy as np

conn = psycopg2.connect("postgresql://postgres:pass@localhost/ragdb")
register_vector(conn)

# Enable extension and create table
with conn.cursor() as cur:
    cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
    cur.execute("""
        CREATE TABLE IF NOT EXISTS documents (
            id      SERIAL PRIMARY KEY,
            content TEXT NOT NULL,
            source  TEXT,
            embedding vector(1536)
        )
    """)
    cur.execute("CREATE INDEX IF NOT EXISTS docs_embedding_idx ON documents USING ivfflat (embedding vector_cosine_ops)")
    conn.commit()

# Insert documents with embeddings
def insert_docs(texts: list[str], source: str):
    vecs = embed_batch(texts)
    with conn.cursor() as cur:
        cur.executemany("""
            INSERT INTO documents (content, source, embedding)
            VALUES (%s, %s, %s)
        """, [(text, source, vec) for text, vec in zip(texts, vecs)])
    conn.commit()

# Semantic search — pure SQL!
def semantic_search(query: str, top_k: int = 5, source: str = None) -> list[dict]:
    query_vec = embed_openai([query])[0]
    source_filter = "AND source = %s" if source else ""
    params = [query_vec, top_k] if not source else [query_vec, source, top_k]

    with conn.cursor() as cur:
        cur.execute(f"""
            SELECT content, source,
                   1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            {f"WHERE source = %s" if source else ""}
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, [query_vec] + ([source] if source else []) + [query_vec, top_k])
        rows = cur.fetchall()
    return [{"content": r[0], "source": r[1], "similarity": r[2]} for r in rows]

# pgvector distance operators:
# <->   Euclidean distance
# <=>   Cosine distance (1 - cosine_similarity)
# <#>   Negative dot product

⚡

FAISS — Maximum Performance Library

High Scale

pip install faiss-cpu   # or faiss-gpu for GPU

import faiss
import numpy as np

# Build an index
dimension = 1536

# Flat (brute force) — exact, best for < 100k vectors
index = faiss.IndexFlatIP(dimension)   # Inner Product (= cosine for normalised)

# IVF (Inverted File) — fast approximate, for > 100k vectors
nlist = 100   # number of clusters
quantiser = faiss.IndexFlatIP(dimension)
index = faiss.IndexIVFFlat(quantiser, dimension, nlist, faiss.METRIC_INNER_PRODUCT)

# Add vectors (normalised for cosine similarity)
vecs = np.array(embed_batch(documents), dtype='float32')
faiss.normalize_L2(vecs)   # in-place L2 normalisation

if isinstance(index, faiss.IndexIVFFlat):
    index.train(vecs)   # IVF index must be trained first
index.add(vecs)

# Search
query_vec = np.array(embed_openai(["packet processing"]), dtype='float32')
faiss.normalize_L2(query_vec)
distances, indices = index.search(query_vec, k=5)

for dist, idx in zip(distances[0], indices[0]):
    if idx != -1:   # -1 means not enough results
        print(f"Score: {dist:.3f} | {documents[idx][:60]}")

# Save and load index
faiss.write_index(index, "docs.faiss")
index = faiss.read_index("docs.faiss")

⚠️ FAISS does not store document text — only vectors and integer IDs. You must maintain a separate mapping from FAISS index position → document text (a Python list or SQLite table). This is the most common FAISS mistake.

FREE LEARNING RESOURCES

Type	Resource	Best For
Article	OpenAI Embeddings Guide — platform.openai.com/docs/guides/embeddings	Best introduction to text embeddings. Covers use cases, models, and similarity metrics with examples.
Docs	ChromaDB Documentation — docs.trychroma.com	Complete ChromaDB reference. Start with the Getting Started guide.
Docs	Qdrant Documentation — qdrant.tech/documentation	Production-quality vector DB. Excellent filtering and performance documentation.
Article	HuggingFace: Getting Started with Embeddings — huggingface.co/blog	Free embedding models with sentence-transformers. Hands-on with real code.
Docs	pgvector — github.com/pgvector/pgvector	Vector search in PostgreSQL. README covers all operators and index types.

MILESTONE PROJECT

🛠 Semantic Document Search Engine [Intermediate] 3–4 days

Build a complete semantic search engine over a collection of real documents — the foundation layer for your RAG system in M18.

Requirements

Index at least 50 real documents (PDF or text files from any domain you care about)
Embed all documents using OpenAI text-embedding-3-small with batch embedding and caching
Store in ChromaDB with metadata: source, date, category, chunk_idx
Build a query function: search(query, top_k=5, filter_source=None) → returns ranked results with similarity scores
Compare semantic search vs keyword search on 10 queries — show where semantic wins
FastAPI endpoint: POST /search with Pydantic request/response models

Stretch Goals

Add a second collection using a free HuggingFace model — compare retrieval quality
Implement the same search in pgvector — compare query time for 1000 documents
Add an embedding cache to SQLite — verify zero API calls on re-indexing same documents

Skills: OpenAI embeddings, ChromaDB, batch processing, metadata filtering, FastAPI, Pydantic

LAB 1

Visualise the Embedding Space — Make Semantic Similarity Concrete

Objective: See embeddings as geometry — observe that similar texts cluster together in vector space.

Create 15 texts in 3 clusters: 5 about networking/DPDK, 5 about machine learning, 5 about cooking. Embed all 15 with OpenAI or a free HuggingFace model.

Compute the full 15×15 cosine similarity matrix. Print it as a formatted table. Observe: within-cluster scores should be 0.7–0.95, cross-cluster scores should be 0.1–0.4.

Use PCA to reduce to 2D: from sklearn.decomposition import PCA; coords = PCA(n_components=2).fit_transform(vecs). Print the 2D coordinates for each text. Do texts cluster as expected?

Now embed 3 query texts: "What is packet processing?", "How does gradient descent work?", "How do I make pasta?". For each query, compute similarity to all 15 texts. Verify the top results match the expected cluster.

Bonus: Add 2 ambiguous texts that belong to two clusters simultaneously (e.g. "AI-powered network packet classification"). Where do they land in the similarity matrix?

LAB 2

ChromaDB Full Lifecycle — Index, Query, Filter, Update

Objective: Build fluency with ChromaDB by exercising every operation.

Create a persistent ChromaDB collection with OpenAI embeddings. Add 30 documents from at least 3 different sources (e.g. DPDK docs, Python docs, cooking recipes). Store source, date, and category in metadata.

Run these 5 query scenarios and verify results make sense: (a) semantic only — top 5 for a domain query. (b) semantic + source filter. (c) semantic + date filter. (d) semantic + $and filter combining source and category. (e) direct ID lookup: collection.get(ids=["doc1"]).

Update: change the category metadata for 3 documents. Verify with collection.get() that metadata updated but embedding is unchanged.

Delete 5 documents. Verify collection.count() decreased. Verify deleted IDs no longer appear in query results.

Test persistence: stop your Python process, restart, re-create the PersistentClient with the same path. Verify all documents are still present. This is the critical test for production use.

LAB 3

Compare Vector DBs — Same Data, Same Queries

Objective: Run the same workload on two different vector DBs and compare the experience.

Take your 30 documents from Lab 2. Index them in both ChromaDB (already done) and Qdrant (start with Docker: docker run -p 6333:6333 qdrant/qdrant).

Run the same 5 queries on both. Compare: (a) results match? (b) query latency (time it with time.perf_counter()). (c) metadata filtering syntax differences.

Add 1000 synthetic documents to both (generate with random text + embeddings). Re-run timing. How does each DB scale?

If you have PostgreSQL available: implement the same search with pgvector. Compare SQL query syntax to ChromaDB/Qdrant API. What are the advantages of each approach?

Document: Based on your experience, which DB would you choose for: (a) prototype RAG with 1k docs, (b) production RAG with 100k docs self-hosted, (c) production RAG with 10M docs managed service?

P5-M15 MASTERY CHECKLIST

Can explain what an embedding is — a vector that captures semantic meaning — without referring to math
Can generate embeddings using OpenAI, Cohere, and a free HuggingFace sentence-transformers model
Always batch embed — never call the API in a per-text loop
Know to cache embeddings to SQLite to avoid re-embedding the same content
Know the difference between cosine similarity, dot product, and Euclidean distance — and which to use for text
Can explain ANN (approximate nearest neighbor) and why it is faster than brute-force for large collections
Can choose the right vector DB for a use case: ChromaDB for prototypes, Qdrant for self-hosted production, Pinecone for managed cloud, pgvector if already using Postgres
Can create a ChromaDB collection, add documents with metadata, and query with semantic + metadata filtering
Can use Pinecone to upsert vectors, query by similarity, and filter by metadata
Can use Qdrant with Docker for self-hosted vector search with filters
Know the pgvector distance operators: <-> (Euclidean), <=> (cosine), <#> (dot product)
Know that FAISS does not store document text — you must maintain a separate ID→text mapping
Know that Cohere embeddings require different input_type for documents ("search_document") vs queries ("search_query")
Completed Lab 1: visualised embedding space with similarity matrix
Completed Lab 2: ChromaDB full lifecycle including persistence test
Completed Lab 3: compared two vector DBs on same workload
Milestone project pushed to GitHub with README

✅ When complete: Move to P5-M16 — Chunking & Document Ingestion. Now you know how to store and search vectors — next you learn how to prepare documents before embedding them.

← P4-M14: Reliability & Security 🗺️ All Modules Next: P5-M16 — Chunking →