Core Concept
Production recommendation systems use a two-stage pipeline: retrieval (fast, broad) followed by ranking (slow, precise). Content and collaborative signals enter at different stages based on their computational cost.
Retrieval Stage
Goal: reduce 10 million items to 1000 candidates in under 20ms. Methods: ANN search on collaborative embeddings, category filters, content similarity search. Each retrieval source produces candidates. Merge and deduplicate.
Content-based retrieval: precompute item embeddings from text, images, categories. At request time, embed user preferences and ANN search for similar items. Fast because item embeddings are precomputed.
Ranking Stage
Goal: order 1000 candidates by predicted engagement. Can use complex features: user-item cross features, sequence models, contextual signals. Latency budget: 50-100ms for 1000 items.
Ranking models see both content and collaborative signals as features. Item popularity, user historical engagement, content similarity to user profile, and collaborative embedding similarity all become input features. A gradient-boosted tree or neural ranker learns optimal weighting.
✅ Best Practice: Use multiple retrieval sources. Collaborative retrieval finds behaviorally similar items. Content retrieval finds semantically similar items. Popularity retrieval ensures some baseline quality. Merge candidates from all sources before ranking.
✓Stage one ANN retrieval pulls 500 to 5,000 candidates in 5 to 30ms P95 using quantized indices: 100M items at 256 dims drops from 102 GB float32 to under 10 to 20 GB per shard with product quantization and minimal recall loss
✓Stage two re ranker scores 200 to 1,000 candidates with learned models using rich features (similarity, recency, popularity, diversity) in 50 to 150ms P95, applying post rank constraints for policy, safety, and deduplication
✓Total end to end latency target for interactive surfaces is under 200ms P95 to P99, with systems sharding by item ID or semantic clusters and replicating hot shards to handle 10,000+ QPS per region
✓Freshness managed through daily or hourly offline index builds plus streaming hot item updates into small overlay indices merged at query time, with shadow traffic validation before rollout and automatic rollback on regression
✓User profiles computed as recency weighted sums of engaged item vectors with exponential decay (7 to 14 day half life) and interaction specific weights where purchases count more than views