Natural Language Processing SystemsSemantic Search (Dense Embeddings, ANN)Medium⏱️ ~3 min

Why Approximate Nearest Neighbor (ANN) and Core Algorithm Families

Exact nearest neighbor search requires comparing a query vector against every vector in your database, which scales linearly with corpus size. For 100 million vectors with 384 dimensions, that means 100 million distance calculations per query. Even with optimized SIMD (Single Instruction, Multiple Data) instructions on modern CPUs, this takes hundreds of milliseconds to seconds, far too slow for interactive search with budgets of 50 to 150 milliseconds end to end. Approximate Nearest Neighbor (ANN) algorithms solve this by finding close neighbors quickly with high recall but not perfect accuracy. Instead of checking all 100 million vectors, an ANN index might check only 10,000 to 100,000 candidates and still achieve 90 to 95 percent recall at 10, meaning it finds 9 or 9.5 of the true top 10 neighbors. The latency drops from seconds to 2 to 20 milliseconds at the 95th percentile, making real time search practical. There are three main ANN algorithm families, each with different trade offs. Graph based methods like Hierarchical Navigable Small World (HNSW) build a proximity graph where each vector connects to its nearest neighbors. Search navigates this graph like following highway exits to local roads. HNSW delivers extremely fast CPU queries, typically 1 to 5 milliseconds median latency at high recall, but uses 1.2 to 2.0 times the raw vector memory for graph links. This makes it ideal for 1 to 20 million vectors per node without compression. Quantization based methods like Inverted File (IVF) with Product Quantization compress vectors dramatically. IVF first clusters vectors into partitions using k means (commonly 4,096 to 65,536 centroids), then only searches the nearest few partitions. Product Quantization then compresses each vector from, for example, 768 floats times 4 bytes equals 3,072 bytes down to 16 to 64 bytes, achieving 50 to 100 times compression. A 200 million vector corpus that would need 614 GB raw can fit in 10 to 20 GB with quantization. The cost is 1 to 5 percent recall loss and a training phase to learn centroids. Google's ScaNN uses anisotropic quantization to improve this trade off further. Tree based methods like Annoy partition space recursively using random projection trees. They are simpler to implement and work well for tens of millions of vectors, but generally offer worse recall versus latency trade offs than HNSW or quantization at larger scales. Spotify uses Annoy for audio and track similarity at tens of millions scale. For systems with fewer than 1 million vectors, brute force with batched matrix multiply on GPU can actually be simpler and faster than any ANN structure.
💡 Key Takeaways
Exact nearest neighbor scales linearly and takes hundreds of milliseconds for 100 million vectors; ANN reduces this to 2 to 20 milliseconds with 90 to 95 percent recall at 10
HNSW graph method delivers 1 to 5 millisecond median latency on CPU but uses 1.5 to 2 times raw vector memory, best for 1 to 20 million vectors per node
IVF with Product Quantization compresses vectors 50 to 100 times (768 float32 from 3,072 bytes to 32 bytes) enabling 200 million vectors in 10 to 20 GB RAM with 1 to 5 percent recall loss
Algorithm choice depends on scale: brute force GPU for under 1 million, HNSW for 1 to 20 million with updates, IVF PQ for 20 million to 1 billion batch workloads
Google ScaNN and Meta FAISS are production systems using quantization for billion scale search; Spotify uses Annoy tree method for tens of millions of audio tracks
📌 Examples
Meta FAISS powers billion scale similarity search in recommendation systems, using IVF with 65,536 centroids and 32 byte product quantization codes per vector
Elasticsearch implements HNSW for dense vector search, enabling hybrid retrieval with keyword filters on tens of millions of documents per shard with 5 to 10 millisecond p95 latency
Microsoft Bing uses ScaNN for web search dense retrieval at 10 million to 1 billion scale, achieving better recall versus latency than standard IVF PQ
← Back to Semantic Search (Dense Embeddings, ANN) Overview