Learn→Embeddings & Similarity Search→Embedding Generation (BERT, Sentence-BERT, Graph Embeddings)→4 of 6

Embeddings & Similarity Search • Embedding Generation (BERT, Sentence-BERT, Graph Embeddings)Hard⏱️ ~3 min

Index Architecture: Memory, Quantization, and Approximate Search

MEMORY REQUIREMENTS
Raw storage: 768-dimensional float32 vectors require 768 × 4 bytes = 3KB per embedding. At scale: 1 billion embeddings × 3KB = 3 terabytes just for vector storage. Brute-force search over 3TB is prohibitively slow and expensive. Solutions: vector quantization to reduce storage and approximate search to reduce computation.
QUANTIZATION OPTIONS
Scalar quantization: Convert each float32 to int8. 4x compression with 2-5% recall loss. Simple implementation, fast inference. Good default choice.
Product quantization (PQ): Split vector into M subvectors (typically 8-32), quantize each to 256 centroids represented as 1 byte. 768 dims → 64 bytes = 48x compression. Distance computed via pre-computed lookup tables instead of full dot product. 5-10% recall loss.
Binary quantization: Each dimension becomes 1 bit. 768 dims → 96 bytes = 32x compression. Use Hamming distance for fast comparison. Works for coarse filtering but too lossy for final ranking.
APPROXIMATE SEARCH ALGORITHMS
HNSW: Build a navigable graph where nearby vectors are connected. Search via greedy graph traversal starting from entry point. High recall (95%+) at 1-5ms latency. Trade-off: memory-intensive (stores full vectors plus graph edges).
IVF (Inverted File): Cluster vectors into centroids, store vectors per cluster. At query time, search only the nearest clusters. Lower recall per cluster probed, but much more memory-efficient. Combine IVF with PQ for best memory efficiency at scale.
⚠️ Key Trade-off: More compression = less memory = lower recall. Always tune by measuring recall@K on held-out validation data. Target: 95% recall with minimum memory footprint for your latency budget.

💡 Key Takeaways

✓Raw storage: 768-dim = 3KB per vector; 1B vectors = 3TB

✓PQ: 48x compression with 5-10% recall loss via lookup tables

✓HNSW: high recall but expensive memory; IVF+PQ: best memory efficiency

📌 Interview Tips

1Interview Tip: Explain the memory math—1B × 3KB = 3TB uncompressed, then describe compression options.

2Interview Tip: Compare HNSW vs IVF—HNSW for latency, IVF+PQ for memory-constrained environments.

← Back to Embedding Generation (BERT, Sentence-BERT, Graph Embeddings) Overview