Production Serving: Precomputing Items and ANN Search

The production serving pattern for two tower models separates offline batch computation from online request serving. Item embeddings are computed in a daily or hourly batch job over the full catalog, then loaded into a vector index optimized for Maximum Inner Product Search (MIPS) or cosine similarity. At request time, you compute only the user embedding via a single forward pass through the user tower, then query the prebuilt index for the top K nearest neighbors. This asymmetry is what enables sub 50ms retrieval at billion scale.

Approximate Nearest Neighbor (ANN) algorithms are essential because exact brute force search scales linearly with catalog size. Meta's FAISS and Google's ScaNN are the most common production choices. FAISS on GPU can serve tens of thousands of queries per second per shard with 5 to 15ms p95 latency on 10 million vectors, achieving 85 to 95% recall compared to exact search. ScaNN on CPU reports 2 to 10× speedups over prior methods at 95% recall with single digit millisecond latency on 1 to 10 million vectors per shard. Tuning the recall versus latency tradeoff is critical: increasing recall from 85% to 95% can double or triple latency and memory bandwidth.

Memory planning is a key constraint. Raw storage for 200 million items at 128 dimensions in float32 is 200M × 128 × 4 bytes = 102 GB. Quantization to 8 bit integers cuts this to 25.6 GB with minimal accuracy loss. Index overhead from graph structures like Hierarchical Navigable Small World (HNSW) adds 20 to 100% depending on parameters. Spotify manages tens of GB per shard for 100+ million tracks; sharding by locale or category reduces search space and keeps per shard memory under 50 GB. Prefiltering for eligibility like geographic availability, age restrictions, or inventory status before ANN search is crucial to maintain high recall and avoid wasting compute on ineligible items.

💡 Key Takeaways

•Item embeddings are precomputed offline in batch jobs and loaded into Approximate Nearest Neighbor (ANN) indexes like FAISS or ScaNN; online serving only computes user embedding and queries the index

•FAISS on GPU serves tens of thousands of queries per second per shard at 5 to 15ms p95 latency on 10 million vectors with 85 to 95% recall; ScaNN on CPU achieves single digit millisecond latency with 2 to 10× speedup over baselines

•Memory for 200 million items at 128 dimensions in float32 is 102 GB raw; 8 bit quantization reduces to 25.6 GB; HNSW index overhead adds 20 to 100% bringing typical shard size to 40 to 50 GB

•Tuning ANN recall from 85% to 95% can double or triple query latency and memory bandwidth; production systems balance recall targets against p99 latency budgets of 10 to 50ms

•Prefiltering for eligibility like geographic restrictions, inventory status, or age gates before ANN search is critical to maintain high recall and avoid scoring ineligible items

•Sharding by locale, category, or user segment reduces search space per query; Spotify shards by market to keep per shard size under 50 GB and latency under 5ms

📌 Examples

Meta uses FAISS on GPU for billions of items; each shard handles 10 to 100 million vectors with p95 latency of 5 to 15ms at 90% recall; replication across multiple GPU hosts serves 50K+ QPS with automatic failover

Google ScaNN on CPU retrieves top 100 candidates from 10 million document embeddings in 2 to 10ms per query at 95% recall; prefilters by language and country reduce search space by 80% and improve tail latency

← Back to Two-Tower Models (User/Item Embeddings) Overview