Index Architecture: Memory, Quantization, and Approximate Search
MEMORY REQUIREMENTS
Raw storage: 768-dimensional float32 vectors require 768 × 4 bytes = 3KB per embedding. At scale: 1 billion embeddings × 3KB = 3 terabytes just for vector storage. Brute-force search over 3TB is prohibitively slow and expensive. Solutions: vector quantization to reduce storage and approximate search to reduce computation.
QUANTIZATION OPTIONS
Scalar quantization: Convert each float32 to int8. 4x compression with 2-5% recall loss. Simple implementation, fast inference. Good default choice.
Product quantization (PQ): Split vector into M subvectors (typically 8-32), quantize each to 256 centroids represented as 1 byte. 768 dims → 64 bytes = 48x compression. Distance computed via pre-computed lookup tables instead of full dot product. 5-10% recall loss.
Binary quantization: Each dimension becomes 1 bit. 768 dims → 96 bytes = 32x compression. Use Hamming distance for fast comparison. Works for coarse filtering but too lossy for final ranking.
APPROXIMATE SEARCH ALGORITHMS
HNSW: Build a navigable graph where nearby vectors are connected. Search via greedy graph traversal starting from entry point. High recall (95%+) at 1-5ms latency. Trade-off: memory-intensive (stores full vectors plus graph edges).
IVF (Inverted File): Cluster vectors into centroids, store vectors per cluster. At query time, search only the nearest clusters. Lower recall per cluster probed, but much more memory-efficient. Combine IVF with PQ for best memory efficiency at scale.