Inference at Scale with ANN Search

Core Concept
At inference time, compute the user embedding once (1-5ms), then use Approximate Nearest Neighbor (ANN) search to find the top 1000 items from millions (5-10ms). Total retrieval: under 20ms. The key: item embeddings are pre-computed and indexed offline.
Building The Item Index
After training, run the item tower on every item in the catalog. A catalog of 10 million items takes 10-30 minutes on a single GPU. Store these vectors in an ANN index. Popular choices: HNSW (Hierarchical Navigable Small World) graphs, IVF (Inverted File) with product quantization, or ScaNN from Google.
HNSW builds a graph where each vector connects to its approximate neighbors. To query, start at a random entry point and greedily navigate toward the query vector. With proper tuning, HNSW finds 95% of true top-100 neighbors while scanning only 0.1% of the index. Memory overhead is 1.5-2x the vector storage. For 10M items with 128-dimension vectors (5GB), the index needs 7-10GB RAM.
Request Time Flow
When a user requests recommendations: (1) Gather user features from the feature store including real-time session data. (2) Run the user tower to get their embedding (1-5ms on GPU, 5-15ms on CPU). (3) Query the ANN index to retrieve top-K candidates (5-10ms for K=1000). (4) Return candidates to the ranking stage.
The user tower runs on fresh data every request. If the user just clicked an item, that click immediately influences their embedding. This enables real-time personalization without reindexing. The item index updates in batch: hourly for high-churn catalogs, daily for stable catalogs. New items get indexed immediately via incremental addition to HNSW.
Scaling To Billions
For catalogs exceeding 100 million items, single-node ANN becomes impractical. Solutions: (1) Shard the index across multiple machines, query all shards in parallel, merge results. (2) Use a coarse filter first (category, availability) to reduce candidates before ANN search. (3) Use product quantization (PQ) to compress vectors: 128 dimensions become 16 bytes instead of 512, enabling 32x more items per machine.
PQ introduces recall loss: instead of 95% recall, you might get 85%. The trade-off is worth it at scale. A billion items with PQ fits in 16GB RAM versus 500GB without compression. Query latency stays under 10ms because compressed vectors mean better cache utilization.
✅ Design Principle: Pre-filter before ANN search whenever possible. Geographic availability, category constraints, and stock status can reduce candidates by 10-100x before vector search. This simultaneously improves latency and relevance. A common interview follow-up: "How do you handle filters that change often?" Answer: maintain multiple filtered indexes for common filter combinations, or use metadata filtering during ANN search.

💡 Key Takeaways

✓Serving: compute user embedding (1-10ms) then ANN search over pre-indexed items (5-15ms). Total retrieval under 50ms for 100M+ items

✓Exact search on 100M items at 128 dims = 12.8B operations per query = seconds. ANN achieves milliseconds by narrowing candidates via clever data structures

✓HNSW: graph where nodes connect to similar nodes, navigate in log time. IVF: cluster items, search only nearby clusters. Both trade small accuracy for huge speedup

✓ANN recall: fraction of true top-k in ANN results. 90% recall = find 90 of best 100. Raising recall from 85% to 95% might double latency from 5ms to 10ms

✓Memory: 200M items × 128 dims × 4 bytes = 102 GB raw. Index overhead adds 20-100%. 8-bit quantization reduces 4x with ~1-2% quality loss

✓Sharding: query fans out to all shards, each returns top-k, results merge. Latency = slowest shard + merge time. Typical shard holds 10-50M items

📌 Interview Tips

1When asked what goes into user tower: mention recent interactions (last 10-50 items), session context (device, time, location), and precomputed user segments or embeddings.

2For item tower features: explain the mix of content (text, images, categories) and behavioral (popularity, engagement rates) signals that enable cold start handling.

3When discussing updates: explain that item embeddings are batch-updated (daily/hourly) while user context is computed online per request for freshness.

← Back to Two-Tower Models (User/Item Embeddings) Overview