Learn→ML-Powered Search & Ranking→Scalability (Sharding, Caching, Approximate Search)→3 of 5

ML-Powered Search & Ranking • Scalability (Sharding, Caching, Approximate Search)Hard⏱️ ~3 min

Approximate Nearest Neighbor Search: HNSW vs IVF PQ at Billion Scale

Approximate Nearest Neighbor (ANN) search is the foundation of candidate retrieval in recommendation and semantic search. Exact search with brute force Euclidean distance is feasible up to a few million vectors with Single Instruction Multiple Data (SIMD) or Graphics Processing Unit (GPU) acceleration, but billion scale indices require approximation algorithms that trade recall for massive latency and memory savings. The two dominant production approaches are Hierarchical Navigable Small World (HNSW) graphs and Inverted File with Product Quantization (IVF PQ), each optimizing different parts of the latency, memory, and recall triangle.

HNSW builds a multi layer graph where each node represents a vector and edges connect approximate nearest neighbors. Search starts at the top layer and greedily navigates toward the query, dropping to lower layers with denser connectivity as it gets closer. This structure delivers high recall, often 0.95 to 0.98 for top 100 queries, with sub 10 ms latency on Central Processing Unit (CPU) for tens to hundreds of millions of vectors. The cost is memory: roughly 1 to 1.5 times the raw vector size plus 64 to 128 bytes per vector for graph edges depending on the M parameter, which controls neighbor count. For 100 million 128 dimensional float32 vectors, expect 50 to 70 Gigabytes (GB) total memory. Amazon OpenSearch and Pinterest use HNSW for image and product retrieval at this scale.

IVF PQ compresses vectors aggressively to enable billion scale search on commodity hardware. The index first partitions the vector space into coarse clusters using k means, creating an inverted file structure. Each vector is quantized using product quantization, splitting the 128 dimensions into 16 subvectors of 8 dimensions each, then encoding each subvector as an 8 bit code via a learned codebook. This compresses each vector from 512 bytes (128 floats) to just 16 bytes. Search probes a subset of clusters, typically 64 to 256 out of 262,144 clusters, and computes approximate distances using the quantized codes. Meta reports 0.9 recall at under 20 ms on CPU for 1 billion vectors with FAISS IVF PQ, using prefetch optimizations to hide memory latency.

The trade off is nuanced. HNSW wins on latency and recall for datasets fitting in memory, making it ideal for hundreds of millions of high value items like products or images. IVF PQ scales to billions by sacrificing some recall and adding quantization error, but keeps infrastructure costs manageable. For a 1 billion vector index, HNSW would require 500 to 700 GB of memory across the cluster, while IVF PQ needs only 16 GB for codes plus overhead for centroids. GPU acceleration helps both but favors IVF PQ because batch distance computations on quantized codes map well to GPU parallelism, yielding 5 to 10 times speedup. Google ScaNN reports similar compression and speedup on Tensor Processing Units (TPUs).

Failure modes matter in production. As the model retrains and vector distributions drift, fixed index parameters can degrade recall silently. A system tuned for 0.92 recall might drop to 0.85 after a major model update, causing recommendation quality to degrade without obvious errors. Monitor recall by running shadow queries against a small exact search reference on a sampled subset. Index rebuilding for billion scale indices takes hours, so use blue green deployment: build the new index in parallel, validate offline recall, then atomically swap routing.

💡 Key Takeaways

•HNSW delivers 0.95 to 0.98 recall with 5 to 10 ms latency on CPU for 100 to 500 million vectors but requires 500 to 700 GB memory for 1 billion vectors due to graph overhead of 64 to 128 bytes per node.

•IVF PQ compresses vectors from 512 bytes to 16 bytes using product quantization, enabling 1 billion vectors in 16 to 30 GB memory. Meta FAISS achieves 0.9 recall at under 20 ms on CPU by probing 64 to 256 clusters.

•Trade off is recall and latency versus memory cost. HNSW suits high value datasets fitting in memory like product catalogs. IVF PQ scales to billions at lower recall, ideal for candidate retrieval where downstream ranking refines results.

•GPU acceleration provides 5 to 10 times speedup but increases cost per QPS. Batch queries across requests to amortize GPU overhead, but watch for queueing delays that hurt P99 latency.

•Model retraining drifts vector distributions, silently degrading recall from 0.92 to 0.85. Monitor by running shadow exact search on sampled queries. Rebuild indices with blue green deployment to avoid routing mismatches during multi hour rebuilds.

📌 Examples

Meta uses FAISS IVF PQ for billion scale item retrieval in recommendation, achieving 0.9 recall at 15 to 20 ms per query on CPU with 262,144 coarse centroids and 16 byte quantized vectors.

Pinterest HNSW based image retrieval handles 200 million images with 10 times throughput improvement over exact search, serving sub 10 ms latencies at 0.96 recall with M equals 16 and efSearch equals 200.

Amazon OpenSearch supports both HNSW and IVF PQ for product search. A fashion catalog with 50 million items uses HNSW at 8 ms median latency, while a 2 billion listing corpus uses IVF PQ sharded across 80 nodes.

← Back to Scalability (Sharding, Caching, Approximate Search) Overview