Recommendation SystemsContent-Based Filtering & Hybrid ApproachesMedium⏱️ ~3 min

Production Architecture: Two Stage Retrieval and Re Ranking Pipeline

Core Concept
Production recommendation systems use a two-stage pipeline: retrieval (fast, broad) followed by ranking (slow, precise). Content and collaborative signals enter at different stages based on their computational cost.

Retrieval Stage

Goal: reduce 10 million items to 1000 candidates in under 20ms. Methods: ANN search on collaborative embeddings, category filters, content similarity search. Each retrieval source produces candidates. Merge and deduplicate.

Content-based retrieval: precompute item embeddings from text, images, categories. At request time, embed user preferences and ANN search for similar items. Fast because item embeddings are precomputed.

Ranking Stage

Goal: order 1000 candidates by predicted engagement. Can use complex features: user-item cross features, sequence models, contextual signals. Latency budget: 50-100ms for 1000 items.

Ranking models see both content and collaborative signals as features. Item popularity, user historical engagement, content similarity to user profile, and collaborative embedding similarity all become input features. A gradient-boosted tree or neural ranker learns optimal weighting.

✅ Best Practice: Use multiple retrieval sources. Collaborative retrieval finds behaviorally similar items. Content retrieval finds semantically similar items. Popularity retrieval ensures some baseline quality. Merge candidates from all sources before ranking.
💡 Key Takeaways
Stage one ANN retrieval pulls 500 to 5,000 candidates in 5 to 30ms P95 using quantized indices: 100M items at 256 dims drops from 102 GB float32 to under 10 to 20 GB per shard with product quantization and minimal recall loss
Stage two re ranker scores 200 to 1,000 candidates with learned models using rich features (similarity, recency, popularity, diversity) in 50 to 150ms P95, applying post rank constraints for policy, safety, and deduplication
Total end to end latency target for interactive surfaces is under 200ms P95 to P99, with systems sharding by item ID or semantic clusters and replicating hot shards to handle 10,000+ QPS per region
Freshness managed through daily or hourly offline index builds plus streaming hot item updates into small overlay indices merged at query time, with shadow traffic validation before rollout and automatic rollback on regression
User profiles computed as recency weighted sums of engaged item vectors with exponential decay (7 to 14 day half life) and interaction specific weights where purchases count more than views
📌 Interview Tips
1For latency budgets: break down retrieval - content ANN (3-8ms) and collaborative ANN (3-8ms) can run in parallel, merged in 1-2ms, total <15ms for thousands of candidates.
2When asked about index architecture: explain separate indexes for content and CF embeddings, with merge happening at candidate set level (dedup + combine scores).
3For cold start queries: mention content-only fast path when user has insufficient history, avoiding the collaborative retrieval entirely for new users.
← Back to Content-Based Filtering & Hybrid Approaches Overview