Lab 039: Vector Database ComparisonΒΆ
What You'll LearnΒΆ
- Understand the key differences between major vector databases
- Compare pgvector, Chroma, Qdrant, and Azure AI Search on the same task
- Evaluate each on: setup complexity, query speed, filtering, cloud integration
- Choose the right vector database for your agent's requirements
IntroductionΒΆ
Choosing a vector database is one of the most consequential architecture decisions for a RAG-based agent. The "best" choice depends on your team's existing stack, scale requirements, filtering needs, and cloud strategy.
The candidates:
| Database | Type | Free option | Best for |
|---|---|---|---|
| pgvector | PostgreSQL extension | Azure free tier / local | Teams already using PostgreSQL |
| Chroma | Embedded / server | Fully open-source | Local development, small projects |
| Qdrant | Dedicated vector DB | Qdrant Cloud free tier | High-scale production, advanced filtering |
| Azure AI Search | Azure service | Free tier (1 index) | Azure-native, hybrid search, enterprise |
PrerequisitesΒΆ
# Install all clients
pip install chromadb qdrant-client openai
# For Azure AI Search (optional)
pip install azure-search-documents
No API keys needed for Chroma (local) and Qdrant (local mode). GitHub Token required for embeddings:
Setup: Shared Embedding FunctionΒΆ
All four tests use the same OutdoorGear product data and the same embedding model:
# shared.py
import os
import math
from openai import OpenAI
client = OpenAI(
base_url="https://models.inference.ai.azure.com",
api_key=os.environ["GITHUB_TOKEN"],
)
PRODUCTS = [
{"id": "P001", "name": "TrailBlazer Tent 2P", "category": "Tents", "price": 249.99, "weight": 1800},
{"id": "P002", "name": "Summit Dome 4P", "category": "Tents", "price": 549.99, "weight": 3200},
{"id": "P003", "name": "TrailBlazer Solo", "category": "Tents", "price": 299.99, "weight": 850},
{"id": "P004", "name": "ArcticDown -20Β°C Sleeping Bag", "category": "Sleeping Bags", "price": 389.99, "weight": 1400},
{"id": "P005", "name": "SummerLight +5Β°C Sleeping Bag", "category": "Sleeping Bags", "price": 149.99, "weight": 700},
{"id": "P006", "name": "Osprey Atmos 65L Backpack", "category": "Backpacks", "price": 289.99, "weight": 1980},
{"id": "P007", "name": "DayHiker 22L Daypack", "category": "Backpacks", "price": 89.99, "weight": 580},
]
def embed(text: str) -> list[float]:
response = client.embeddings.create(model="text-embedding-3-small", input=text)
return response.data[0].embedding
def product_text(p: dict) -> str:
return f"{p['name']}: {p['category']} product, ${p['price']:.2f}, {p['weight']}g"
Option A: Chroma (Local, No Setup)ΒΆ
Chroma is the easiest way to get started β pure Python, runs in-memory or on-disk:
# option_a_chroma.py
import chromadb
from shared import PRODUCTS, embed, product_text
import time
print("=== Option A: ChromaDB (Local) ===\n")
# In-memory collection (no persistence β great for prototyping)
chroma = chromadb.Client()
collection = chroma.create_collection("outdoorgear_products")
# Ingest
start = time.time()
collection.add(
ids=[p["id"] for p in PRODUCTS],
embeddings=[embed(product_text(p)) for p in PRODUCTS],
documents=[product_text(p) for p in PRODUCTS],
metadatas=[{"category": p["category"], "price": p["price"]} for p in PRODUCTS],
)
ingest_ms = (time.time() - start) * 1000
# Query
query = "lightweight tent for solo backpacking"
start = time.time()
results = collection.query(
query_embeddings=[embed(query)],
n_results=3,
)
query_ms = (time.time() - start) * 1000
print(f"Ingest: {ingest_ms:.0f}ms | Query: {query_ms:.0f}ms")
print(f"\nTop 3 results for '{query}':")
for doc, dist in zip(results["documents"][0], results["distances"][0]):
similarity = 1 - dist # Chroma returns distance, not similarity
print(f" {similarity:.3f} | {doc}")
# Filtered query (Chroma supports simple metadata filtering)
start = time.time()
filtered = collection.query(
query_embeddings=[embed(query)],
n_results=3,
where={"category": "Tents"}, # β metadata filter
)
filtered_ms = (time.time() - start) * 1000
print(f"\nFiltered to 'Tents' only ({filtered_ms:.0f}ms):")
for doc in filtered["documents"][0]:
print(f" {doc}")
Option B: Qdrant (Local Server Mode)ΒΆ
Qdrant offers advanced filtering and scales to hundreds of millions of vectors:
# Run Qdrant locally with Docker (or use Qdrant Cloud free tier)
docker run -p 6333:6333 qdrant/qdrant
# option_b_qdrant.py
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue, Range
from shared import PRODUCTS, embed, product_text
import time
print("=== Option B: Qdrant (Local) ===\n")
client = QdrantClient("localhost", port=6333)
COLLECTION = "outdoorgear"
VECTOR_SIZE = 1536 # text-embedding-3-small
# Create collection
client.recreate_collection(
collection_name=COLLECTION,
vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
)
# Ingest with payload (rich metadata support)
start = time.time()
points = [
PointStruct(
id=int(p["id"][1:]), # P001 β 1
vector=embed(product_text(p)),
payload={
"id": p["id"],
"name": p["name"],
"category": p["category"],
"price": p["price"],
"weight": p["weight"],
},
)
for p in PRODUCTS
]
client.upsert(collection_name=COLLECTION, points=points)
ingest_ms = (time.time() - start) * 1000
# Semantic search
query = "something warm for cold winter nights"
start = time.time()
results = client.search(
collection_name=COLLECTION,
query_vector=embed(query),
limit=3,
)
query_ms = (time.time() - start) * 1000
print(f"Ingest: {ingest_ms:.0f}ms | Query: {query_ms:.0f}ms")
print(f"\nTop 3 for '{query}':")
for r in results:
print(f" {r.score:.3f} | [{r.payload['id']}] {r.payload['name']}")
# Advanced filter: Tents under $300
start = time.time()
filtered = client.search(
collection_name=COLLECTION,
query_vector=embed("lightweight shelter"),
limit=3,
query_filter=Filter(
must=[
FieldCondition(key="category", match=MatchValue(value="Tents")),
FieldCondition(key="price", range=Range(lte=300.0)),
]
),
)
filtered_ms = (time.time() - start) * 1000
print(f"\nTents under $300 ({filtered_ms:.0f}ms):")
for r in filtered:
print(f" {r.score:.3f} | [{r.payload['id']}] {r.payload['name']} ${r.payload['price']:.2f}")
Option C: pgvector (Azure PostgreSQL or Local)ΒΆ
See Lab 031 for the full pgvector setup. Quick comparison:
# option_c_pgvector_query.py
import psycopg2
import os
from shared import embed
# Using Azure PostgreSQL with pgvector
conn = psycopg2.connect(
host=os.environ["PG_HOST"],
dbname=os.environ["PG_DATABASE"],
user=os.environ["PG_USER"],
password=os.environ["PG_PASSWORD"],
sslmode="require",
)
cur = conn.cursor()
query_vec = embed("lightweight tent for solo backpacking")
query_str = "[" + ",".join(str(v) for v in query_vec) + "]"
# Cosine similarity search using <=> operator
cur.execute("""
SELECT p.name, p.category, p.price_usd,
1 - (pe.embedding <=> %s::vector) AS similarity
FROM product_embeddings pe
JOIN products p ON p.id = pe.product_id
ORDER BY pe.embedding <=> %s::vector
LIMIT 3;
""", [query_str, query_str])
print("=== Option C: pgvector ===")
for name, category, price, sim in cur.fetchall():
print(f" {sim:.3f} | {name} ({category}) ${price:.2f}")
Option D: Azure AI Search (Hybrid Search)ΒΆ
Azure AI Search uniquely supports hybrid search: vector + BM25 keyword search combined with semantic reranking:
# option_d_azure_search.py
# pip install azure-search-documents
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery
from azure.core.credentials import AzureKeyCredential
from shared import embed
import os
client = SearchClient(
endpoint=os.environ["AZURE_SEARCH_ENDPOINT"],
index_name="outdoorgear-products",
credential=AzureKeyCredential(os.environ["AZURE_SEARCH_KEY"]),
)
query = "lightweight backpacking shelter"
query_vec = embed(query)
# Hybrid search: vector + keyword + semantic reranking
results = client.search(
search_text=query, # BM25 keyword search
vector_queries=[
VectorizedQuery(
vector=query_vec,
k_nearest_neighbors=3,
fields="embedding", # the vector field in your index
)
],
query_type="semantic", # semantic reranking (requires Semantic tier)
semantic_configuration_name="default",
top=3,
)
print("=== Option D: Azure AI Search (Hybrid) ===")
for r in results:
print(f" @score={r['@search.score']:.3f} | [{r['id']}] {r['name']}")
Decision FrameworkΒΆ
Start here:
β
Already using PostgreSQL?
β YES β Use pgvector (Lab 031)
β NO β
Need Azure-native + hybrid search?
β YES β Azure AI Search
β NO β
Local dev / prototype?
β YES β Chroma (zero setup)
β NO β
Need advanced filtering + high scale?
β YES β Qdrant
β NO β Chroma or pgvector
Comparison TableΒΆ
| pgvector | Chroma | Qdrant | Azure AI Search | |
|---|---|---|---|---|
| Setup | Medium (DB needed) | Minimal | Easy (Docker) | Medium (Azure) |
| Local dev | β (Docker) | β (in-memory) | β (Docker) | β (Azure only) |
| Hybrid search | β (manual) | β | β | β (built-in) |
| Filtering | SQL WHERE | Basic metadata | Advanced | Full OData |
| Scale | Moderate (< 1M) | Small (< 100K) | High (> 100M) | High (enterprise) |
| Azure integration | β (managed PG) | β | Partial | β (native) |
| Cost (free tier) | Free tier PG | Free | Qdrant Cloud free | 1 index free |
π§ Knowledge CheckΒΆ
1. What is hybrid search, and why is it often better than pure vector search?
Hybrid search combines vector (semantic) search with keyword (BM25) search and ranks results using both signals. Vector search excels at semantic understanding ("warm shelter" β sleeping bag) but can miss exact matches (a specific product ID). BM25 is excellent for exact keyword matches but misses synonyms. Combining them outperforms either alone, especially for product names, SKUs, and specialized terminology.
2. Why would you choose pgvector over a dedicated vector database like Qdrant?
If you already have PostgreSQL as your primary database, pgvector adds vector search without adding another service to operate, maintain, and pay for. The data lives alongside your relational data β you can JOIN product records with their embeddings in a single query. For most applications under 1M vectors, pgvector's performance is excellent. Choose Qdrant when you need > 100M vectors or very advanced filtering.
3. What is the IVFFlat index in pgvector, and when should you use HNSW instead?
IVFFlat (Inverted File Index with Flat quantization): fast to build, uses less memory, good for datasets that don't change frequently. Uses approximate search β lists parameter controls recall/speed trade-off. HNSW (Hierarchical Navigable Small World): better recall, faster queries, but higher memory usage and slower to build. Use IVFFlat for datasets < 1M that don't change much; use HNSW for frequently updated datasets or when recall is critical. Both require pgvector β₯ 0.5.0.
SummaryΒΆ
For the OutdoorGear learning hub scenario (< 10K products, Azure infrastructure, team knows SQL):
Recommended: pgvector on Azure Database for PostgreSQL Flexible Server.
- No new service to learn
- SQL + vector in one query
- Free tier available
- Production-ready with Lab 031 migrations
Next StepsΒΆ
- Build the pgvector setup: β Lab 031 β pgvector Semantic Search
- Full RAG application: β Lab 022 β RAG with GitHub Models + pgvector
- Enterprise RAG with evaluation: β Lab 042 β Enterprise RAG