Skip to content

Lab 039: Vector Database ComparisonΒΆ

Level: L300 Path: πŸ“š RAG Time: ~50 min πŸ’° Cost: Free β€” All options have free tiers or local mode

What You'll LearnΒΆ

  • Understand the key differences between major vector databases
  • Compare pgvector, Chroma, Qdrant, and Azure AI Search on the same task
  • Evaluate each on: setup complexity, query speed, filtering, cloud integration
  • Choose the right vector database for your agent's requirements

IntroductionΒΆ

Choosing a vector database is one of the most consequential architecture decisions for a RAG-based agent. The "best" choice depends on your team's existing stack, scale requirements, filtering needs, and cloud strategy.

The candidates:

Database Type Free option Best for
pgvector PostgreSQL extension Azure free tier / local Teams already using PostgreSQL
Chroma Embedded / server Fully open-source Local development, small projects
Qdrant Dedicated vector DB Qdrant Cloud free tier High-scale production, advanced filtering
Azure AI Search Azure service Free tier (1 index) Azure-native, hybrid search, enterprise

PrerequisitesΒΆ

# Install all clients
pip install chromadb qdrant-client openai

# For Azure AI Search (optional)
pip install azure-search-documents

No API keys needed for Chroma (local) and Qdrant (local mode). GitHub Token required for embeddings:

export GITHUB_TOKEN=<your PAT>


Setup: Shared Embedding FunctionΒΆ

All four tests use the same OutdoorGear product data and the same embedding model:

# shared.py
import os
import math
from openai import OpenAI

client = OpenAI(
    base_url="https://models.inference.ai.azure.com",
    api_key=os.environ["GITHUB_TOKEN"],
)

PRODUCTS = [
    {"id": "P001", "name": "TrailBlazer Tent 2P",          "category": "Tents",         "price": 249.99, "weight": 1800},
    {"id": "P002", "name": "Summit Dome 4P",                "category": "Tents",         "price": 549.99, "weight": 3200},
    {"id": "P003", "name": "TrailBlazer Solo",              "category": "Tents",         "price": 299.99, "weight":  850},
    {"id": "P004", "name": "ArcticDown -20Β°C Sleeping Bag", "category": "Sleeping Bags", "price": 389.99, "weight": 1400},
    {"id": "P005", "name": "SummerLight +5Β°C Sleeping Bag", "category": "Sleeping Bags", "price": 149.99, "weight":  700},
    {"id": "P006", "name": "Osprey Atmos 65L Backpack",     "category": "Backpacks",     "price": 289.99, "weight": 1980},
    {"id": "P007", "name": "DayHiker 22L Daypack",          "category": "Backpacks",     "price":  89.99, "weight":  580},
]

def embed(text: str) -> list[float]:
    response = client.embeddings.create(model="text-embedding-3-small", input=text)
    return response.data[0].embedding

def product_text(p: dict) -> str:
    return f"{p['name']}: {p['category']} product, ${p['price']:.2f}, {p['weight']}g"

Option A: Chroma (Local, No Setup)ΒΆ

Chroma is the easiest way to get started β€” pure Python, runs in-memory or on-disk:

# option_a_chroma.py
import chromadb
from shared import PRODUCTS, embed, product_text
import time

print("=== Option A: ChromaDB (Local) ===\n")

# In-memory collection (no persistence β€” great for prototyping)
chroma = chromadb.Client()
collection = chroma.create_collection("outdoorgear_products")

# Ingest
start = time.time()
collection.add(
    ids=[p["id"] for p in PRODUCTS],
    embeddings=[embed(product_text(p)) for p in PRODUCTS],
    documents=[product_text(p) for p in PRODUCTS],
    metadatas=[{"category": p["category"], "price": p["price"]} for p in PRODUCTS],
)
ingest_ms = (time.time() - start) * 1000

# Query
query = "lightweight tent for solo backpacking"
start = time.time()
results = collection.query(
    query_embeddings=[embed(query)],
    n_results=3,
)
query_ms = (time.time() - start) * 1000

print(f"Ingest: {ingest_ms:.0f}ms | Query: {query_ms:.0f}ms")
print(f"\nTop 3 results for '{query}':")
for doc, dist in zip(results["documents"][0], results["distances"][0]):
    similarity = 1 - dist  # Chroma returns distance, not similarity
    print(f"  {similarity:.3f} | {doc}")

# Filtered query (Chroma supports simple metadata filtering)
start = time.time()
filtered = collection.query(
    query_embeddings=[embed(query)],
    n_results=3,
    where={"category": "Tents"},   # ← metadata filter
)
filtered_ms = (time.time() - start) * 1000

print(f"\nFiltered to 'Tents' only ({filtered_ms:.0f}ms):")
for doc in filtered["documents"][0]:
    print(f"  {doc}")

Option B: Qdrant (Local Server Mode)ΒΆ

Qdrant offers advanced filtering and scales to hundreds of millions of vectors:

# Run Qdrant locally with Docker (or use Qdrant Cloud free tier)
docker run -p 6333:6333 qdrant/qdrant
# option_b_qdrant.py
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue, Range
from shared import PRODUCTS, embed, product_text
import time

print("=== Option B: Qdrant (Local) ===\n")

client = QdrantClient("localhost", port=6333)

COLLECTION = "outdoorgear"
VECTOR_SIZE = 1536  # text-embedding-3-small

# Create collection
client.recreate_collection(
    collection_name=COLLECTION,
    vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
)

# Ingest with payload (rich metadata support)
start = time.time()
points = [
    PointStruct(
        id=int(p["id"][1:]),   # P001 β†’ 1
        vector=embed(product_text(p)),
        payload={
            "id":       p["id"],
            "name":     p["name"],
            "category": p["category"],
            "price":    p["price"],
            "weight":   p["weight"],
        },
    )
    for p in PRODUCTS
]
client.upsert(collection_name=COLLECTION, points=points)
ingest_ms = (time.time() - start) * 1000

# Semantic search
query = "something warm for cold winter nights"
start = time.time()
results = client.search(
    collection_name=COLLECTION,
    query_vector=embed(query),
    limit=3,
)
query_ms = (time.time() - start) * 1000

print(f"Ingest: {ingest_ms:.0f}ms | Query: {query_ms:.0f}ms")
print(f"\nTop 3 for '{query}':")
for r in results:
    print(f"  {r.score:.3f} | [{r.payload['id']}] {r.payload['name']}")

# Advanced filter: Tents under $300
start = time.time()
filtered = client.search(
    collection_name=COLLECTION,
    query_vector=embed("lightweight shelter"),
    limit=3,
    query_filter=Filter(
        must=[
            FieldCondition(key="category", match=MatchValue(value="Tents")),
            FieldCondition(key="price", range=Range(lte=300.0)),
        ]
    ),
)
filtered_ms = (time.time() - start) * 1000

print(f"\nTents under $300 ({filtered_ms:.0f}ms):")
for r in filtered:
    print(f"  {r.score:.3f} | [{r.payload['id']}] {r.payload['name']} ${r.payload['price']:.2f}")

Option C: pgvector (Azure PostgreSQL or Local)ΒΆ

See Lab 031 for the full pgvector setup. Quick comparison:

# option_c_pgvector_query.py
import psycopg2
import os
from shared import embed

# Using Azure PostgreSQL with pgvector
conn = psycopg2.connect(
    host=os.environ["PG_HOST"],
    dbname=os.environ["PG_DATABASE"],
    user=os.environ["PG_USER"],
    password=os.environ["PG_PASSWORD"],
    sslmode="require",
)
cur = conn.cursor()

query_vec = embed("lightweight tent for solo backpacking")
query_str = "[" + ",".join(str(v) for v in query_vec) + "]"

# Cosine similarity search using <=> operator
cur.execute("""
    SELECT p.name, p.category, p.price_usd,
           1 - (pe.embedding <=> %s::vector) AS similarity
    FROM product_embeddings pe
    JOIN products p ON p.id = pe.product_id
    ORDER BY pe.embedding <=> %s::vector
    LIMIT 3;
""", [query_str, query_str])

print("=== Option C: pgvector ===")
for name, category, price, sim in cur.fetchall():
    print(f"  {sim:.3f} | {name} ({category}) ${price:.2f}")

Azure AI Search uniquely supports hybrid search: vector + BM25 keyword search combined with semantic reranking:

# option_d_azure_search.py
# pip install azure-search-documents
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.core.credentials import AzureKeyCredential
from shared import embed
import os

client = SearchClient(
    endpoint=os.environ["AZURE_SEARCH_ENDPOINT"],
    index_name="outdoorgear-products",
    credential=AzureKeyCredential(os.environ["AZURE_SEARCH_KEY"]),
)

query = "lightweight backpacking shelter"
query_vec = embed(query)

# Hybrid search: vector + keyword + semantic reranking
results = client.search(
    search_text=query,           # BM25 keyword search
    vector_queries=[
        VectorizedQuery(
            vector=query_vec,
            k_nearest_neighbors=3,
            fields="embedding",  # the vector field in your index
        )
    ],
    query_type="semantic",       # semantic reranking (requires Semantic tier)
    semantic_configuration_name="default",
    top=3,
)

print("=== Option D: Azure AI Search (Hybrid) ===")
for r in results:
    print(f"  @score={r['@search.score']:.3f} | [{r['id']}] {r['name']}")

Decision FrameworkΒΆ

Start here:
         ↓
Already using PostgreSQL?
  β†’ YES β†’ Use pgvector (Lab 031)
  β†’ NO  ↓

Need Azure-native + hybrid search?
  β†’ YES β†’ Azure AI Search
  β†’ NO  ↓

Local dev / prototype?
  β†’ YES β†’ Chroma (zero setup)
  β†’ NO  ↓

Need advanced filtering + high scale?
  β†’ YES β†’ Qdrant
  β†’ NO  β†’ Chroma or pgvector

Comparison TableΒΆ

pgvector Chroma Qdrant Azure AI Search
Setup Medium (DB needed) Minimal Easy (Docker) Medium (Azure)
Local dev βœ… (Docker) βœ… (in-memory) βœ… (Docker) ❌ (Azure only)
Hybrid search βœ… (manual) ❌ βœ… βœ… (built-in)
Filtering SQL WHERE Basic metadata Advanced Full OData
Scale Moderate (< 1M) Small (< 100K) High (> 100M) High (enterprise)
Azure integration βœ… (managed PG) ❌ Partial βœ… (native)
Cost (free tier) Free tier PG Free Qdrant Cloud free 1 index free

🧠 Knowledge Check¢

1. What is hybrid search, and why is it often better than pure vector search?

Hybrid search combines vector (semantic) search with keyword (BM25) search and ranks results using both signals. Vector search excels at semantic understanding ("warm shelter" β†’ sleeping bag) but can miss exact matches (a specific product ID). BM25 is excellent for exact keyword matches but misses synonyms. Combining them outperforms either alone, especially for product names, SKUs, and specialized terminology.

2. Why would you choose pgvector over a dedicated vector database like Qdrant?

If you already have PostgreSQL as your primary database, pgvector adds vector search without adding another service to operate, maintain, and pay for. The data lives alongside your relational data β€” you can JOIN product records with their embeddings in a single query. For most applications under 1M vectors, pgvector's performance is excellent. Choose Qdrant when you need > 100M vectors or very advanced filtering.

3. What is the IVFFlat index in pgvector, and when should you use HNSW instead?

IVFFlat (Inverted File Index with Flat quantization): fast to build, uses less memory, good for datasets that don't change frequently. Uses approximate search β€” lists parameter controls recall/speed trade-off. HNSW (Hierarchical Navigable Small World): better recall, faster queries, but higher memory usage and slower to build. Use IVFFlat for datasets < 1M that don't change much; use HNSW for frequently updated datasets or when recall is critical. Both require pgvector β‰₯ 0.5.0.


SummaryΒΆ

For the OutdoorGear learning hub scenario (< 10K products, Azure infrastructure, team knows SQL):

Recommended: pgvector on Azure Database for PostgreSQL Flexible Server.

  • No new service to learn
  • SQL + vector in one query
  • Free tier available
  • Production-ready with Lab 031 migrations

Next StepsΒΆ