What is Approximate Nearest Neighbor (ANN) Search?
THE EXACT SEARCH PROBLEM
Exact nearest neighbor search compares the query against every vector in the dataset. For 10 million 768-dimensional vectors, that is 10M × 768 = 7.68 billion floating point operations per query. At 100 GFLOPS, that takes 77 milliseconds per query. For real-time applications needing <10ms latency, exact search is impossible at scale.
The math gets worse as datasets grow. 1 billion vectors takes 7.7 seconds per query with exact search. Batch processing overnight might work, but real-time recommendations need sub-10ms latency. You need algorithms that examine only a tiny fraction of vectors while still finding most true neighbors.
HOW ANN ALGORITHMS WORK
ANN algorithms use index structures to skip most comparisons. Instead of O(N) comparisons, they achieve O(log N) or O(√N) by organizing vectors into hierarchies, clusters, or graphs. The index guides search toward promising regions, pruning most of the search space.
Clustering-based (IVF): Partition vectors into clusters. At query time, find nearest cluster centroids, search only within those clusters. Checking 20 clusters out of 1000 reduces work by 50x.
Graph-based (HNSW): Build a navigable graph where nearby vectors are connected. Walk the graph from entry point toward query, following edges to closer neighbors. Typically examines 100-300 nodes to find top-10 in a million-vector index.
Quantization-based (PQ): Compress vectors to reduce memory and speed up distance computation. 768-dim float32 (3KB) becomes 64-byte code. Distance computed via lookup tables instead of full dot product.
RECALL VS LATENCY TRADEOFF
Recall@K: What fraction of the true K nearest neighbors does ANN return? Recall@10 of 0.95 means 9.5 of the true top-10 are found on average.
Tighter search (more clusters probed, more graph nodes visited) improves recall but increases latency. Typical production targets: recall@10 ≥ 0.95 with latency <10ms. You tune index parameters to hit this balance for your data distribution.