Embeddings & Similarity SearchEmbedding Generation (BERT, Sentence-BERT, Graph Embeddings)Hard⏱️ ~3 min

Index Architecture: Memory, Quantization, and Approximate Search

MEMORY REQUIREMENTS

Raw storage: 768-dimensional float32 vectors require 768 × 4 bytes = 3KB per embedding. At scale: 1 billion embeddings × 3KB = 3 terabytes just for vector storage. Brute-force search over 3TB is prohibitively slow and expensive. Solutions: vector quantization to reduce storage and approximate search to reduce computation.

QUANTIZATION OPTIONS

Scalar quantization: Convert each float32 to int8. 4x compression with 2-5% recall loss. Simple implementation, fast inference. Good default choice.

Product quantization (PQ): Split vector into M subvectors (typically 8-32), quantize each to 256 centroids represented as 1 byte. 768 dims → 64 bytes = 48x compression. Distance computed via pre-computed lookup tables instead of full dot product. 5-10% recall loss.

Binary quantization: Each dimension becomes 1 bit. 768 dims → 96 bytes = 32x compression. Use Hamming distance for fast comparison. Works for coarse filtering but too lossy for final ranking.

APPROXIMATE SEARCH ALGORITHMS

HNSW: Build a navigable graph where nearby vectors are connected. Search via greedy graph traversal starting from entry point. High recall (95%+) at 1-5ms latency. Trade-off: memory-intensive (stores full vectors plus graph edges).

IVF (Inverted File): Cluster vectors into centroids, store vectors per cluster. At query time, search only the nearest clusters. Lower recall per cluster probed, but much more memory-efficient. Combine IVF with PQ for best memory efficiency at scale.

⚠️ Key Trade-off: More compression = less memory = lower recall. Always tune by measuring recall@K on held-out validation data. Target: 95% recall with minimum memory footprint for your latency budget.
💡 Key Takeaways
Raw storage: 768-dim = 3KB per vector; 1B vectors = 3TB
PQ: 48x compression with 5-10% recall loss via lookup tables
HNSW: high recall but expensive memory; IVF+PQ: best memory efficiency
📌 Interview Tips
1Interview Tip: Explain the memory math—1B × 3KB = 3TB uncompressed, then describe compression options.
2Interview Tip: Compare HNSW vs IVF—HNSW for latency, IVF+PQ for memory-constrained environments.
← Back to Embedding Generation (BERT, Sentence-BERT, Graph Embeddings) Overview