Embeddings & Similarity SearchEmbedding Generation (BERT, Sentence-BERT, Graph Embeddings)Hard⏱️ ~3 min

Index Architecture: Memory, Quantization, and Approximate Search

At 100 million documents with 768 dimensional embeddings in float32, raw storage demands roughly 307 GB of memory. Exact nearest neighbor search requires scanning all vectors, computing dot products or cosine similarities, and sorting results. At 1000 queries per second, this is infeasible. Production systems compress vectors and use approximate nearest neighbor indices that sacrifice small amounts of recall for orders of magnitude speedup. Quantization is the primary compression technique. Float16 halves memory to 154 GB with negligible quality loss. Product quantization divides each vector into subvectors, clusters each subspace separately, and stores only cluster IDs plus small codebooks. This reduces storage to 5 to 20 bytes per vector. For 100 million items at 16 bytes per vector, total memory drops to roughly 1.6 GB plus codebook overhead, fitting comfortably in RAM. The tradeoff is that distance calculations become approximate, using precomputed lookup tables to estimate true distances. Meta's Faiss library implements product quantization with careful SIMD optimizations, achieving sub 10 millisecond search on CPUs for millions of vectors. Index structures determine update and search performance. Inverted file (IVF) indices partition the space into coarse clusters using k-means, storing vectors in buckets. At search time, the system probes only the nearest clusters, reducing comparisons from 100 million to perhaps 10,000 per query. Hierarchical Navigable Small World (HNSW) builds a multi layer proximity graph where each layer is a navigable small world network. Search starts at the top sparse layer and greedily navigates toward the query, descending layers to refine. HNSW delivers excellent recall and low latency but requires more memory and supports only approximate updates. IVF with product quantization scales to billions of items with daily refreshes, while HNSW suits catalogs needing incremental inserts. GPU acceleration changes the calculus. On a single GPU, Faiss can serve 1 million to 100 million vectors with sub 10 millisecond query latency at 0.9 recall, using product quantization and IVF. At 1 billion items, sharding across hosts or GPUs becomes necessary. Trade off memory budget, update frequency, and recall targets to choose index type. Monitor recall on a held out set and set latency budgets. Many teams maintain backstop keyword indices to catch queries where vector recall drops.
💡 Key Takeaways
At 100 million items with 768 dimensions in float32, raw storage is roughly 307 GB; product quantization compresses to 5 to 20 bytes per vector, reducing to 1.6 GB at 16 bytes per vector
Inverted file (IVF) indices partition space into coarse clusters and probe only nearest buckets at query time, reducing comparisons from millions to thousands
Hierarchical Navigable Small World (HNSW) builds multi layer proximity graphs, delivering excellent recall and sub 10 millisecond latency but requiring more memory and limited update support
Meta Faiss on GPU achieves sub 10 millisecond query latency for 1 million to 100 million vectors at 0.9 recall using product quantization and IVF; billion scale requires sharding
Recall loss from approximate search can drop true neighbors; monitor recall on held out sets and maintain backstop keyword indices to catch failures
Choose IVF with product quantization for large static or daily refreshed catalogs with tight memory budgets, HNSW for catalogs needing incremental inserts with more memory available
📌 Examples
100 million documents at 768 dimensions: product quantization at 16 bytes per vector yields 1.6 GB memory, IVF probes 50 clusters per query, achieves 8 milliseconds latency at 0.88 recall on CPU
1 billion item catalog: shard across 10 hosts with 100 million items each, route queries via consistent hashing, aggregate results across shards in 15 to 25 milliseconds
HNSW index for 10 million product catalog with daily 100,000 new items: supports online inserts, 5 millisecond query latency, 0.92 recall with 12 GB memory footprint
← Back to Embedding Generation (BERT, Sentence-BERT, Graph Embeddings) Overview