Inference at Scale with ANN Search
Building The Item Index
After training, run the item tower on every item in the catalog. A catalog of 10 million items takes 10-30 minutes on a single GPU. Store these vectors in an ANN index. Popular choices: HNSW (Hierarchical Navigable Small World) graphs, IVF (Inverted File) with product quantization, or ScaNN from Google.
HNSW builds a graph where each vector connects to its approximate neighbors. To query, start at a random entry point and greedily navigate toward the query vector. With proper tuning, HNSW finds 95% of true top-100 neighbors while scanning only 0.1% of the index. Memory overhead is 1.5-2x the vector storage. For 10M items with 128-dimension vectors (5GB), the index needs 7-10GB RAM.
Request Time Flow
When a user requests recommendations: (1) Gather user features from the feature store including real-time session data. (2) Run the user tower to get their embedding (1-5ms on GPU, 5-15ms on CPU). (3) Query the ANN index to retrieve top-K candidates (5-10ms for K=1000). (4) Return candidates to the ranking stage.
The user tower runs on fresh data every request. If the user just clicked an item, that click immediately influences their embedding. This enables real-time personalization without reindexing. The item index updates in batch: hourly for high-churn catalogs, daily for stable catalogs. New items get indexed immediately via incremental addition to HNSW.
Scaling To Billions
For catalogs exceeding 100 million items, single-node ANN becomes impractical. Solutions: (1) Shard the index across multiple machines, query all shards in parallel, merge results. (2) Use a coarse filter first (category, availability) to reduce candidates before ANN search. (3) Use product quantization (PQ) to compress vectors: 128 dimensions become 16 bytes instead of 512, enabling 32x more items per machine.
PQ introduces recall loss: instead of 95% recall, you might get 85%. The trade-off is worth it at scale. A billion items with PQ fits in 16GB RAM versus 500GB without compression. Query latency stays under 10ms because compressed vectors mean better cache utilization.