Recommendation SystemsScalability (ANN, HNSW, FAISS)Easy⏱️ ~2 min

Approximate Nearest Neighbor Search: Trading Exactness for Scale

Approximate Nearest Neighbor (ANN) search is the foundation of modern retrieval systems, trading perfect accuracy for dramatic gains in speed and memory efficiency. Instead of exhaustively comparing a query vector against every vector in your database (which becomes impossibly slow at millions of items), ANN algorithms narrow the search to a small candidate set, then rank only those candidates. The core insight is that for most applications, finding the top 10 truly nearest neighbors versus the top 10 approximately nearest neighbors makes little practical difference to end user experience. A recommendation system showing you 10 highly relevant videos instead of the 10 absolute most relevant videos is imperceptible to users, but the system cost difference is enormous. At 100 million vectors with 768 dimensions each (common for text embeddings), storing raw float32 vectors requires approximately 300 GB of memory alone. Exhaustive search would require 100 million distance computations per query. ANN methods like Hierarchical Navigable Small World (HNSW) graphs or Inverted File with Product Quantization (IVF+PQ) reduce this to thousands or tens of thousands of comparisons, delivering results in milliseconds instead of seconds. The key parameters you control are the recall target (what fraction of true nearest neighbors you retrieve) and the latency budget. Production systems typically target 0.90 to 0.95 recall, meaning they find 90 to 95 of the true top 100 neighbors, which provides excellent relevance while keeping latency under 10 to 50 milliseconds at scale.
💡 Key Takeaways
ANN narrows search from millions of vectors to thousands of candidates before ranking, reducing computation by 100 to 1000 times compared to exhaustive search
Production systems target 0.90 to 0.95 recall at top k results, which provides excellent user experience while enabling sub 10 millisecond latencies at scale
Memory scaling is critical: 100 million vectors at 768 dimensions require approximately 300 GB for raw float32 storage before any indexing overhead
Two dominant paradigms exist: graph based indices like HNSW that route through multi layer networks, and quantization based methods like IVF+PQ that partition space and compress vectors
The recall versus latency trade off is tunable at query time through parameters like nprobe (how many partitions to search) or efSearch (how many graph nodes to explore)
📌 Examples
Meta FAISS powers billion scale image and text retrieval across Instagram, Facebook Search, and content moderation systems, processing millions of queries per second
Spotify Annoy enables real time music recommendations by memory mapping read only indices that deliver sub 5 millisecond lookups for playlist generation
Pinterest uses ANN to find visually similar pins from a corpus of billions of images, returning results in under 20 milliseconds to power their visual search feature
← Back to Scalability (ANN, HNSW, FAISS) Overview
Approximate Nearest Neighbor Search: Trading Exactness for Scale | Scalability (ANN, HNSW, FAISS) - System Overflow