Two Stage Retrieval: ANN Candidate Generation + Re-Ranking
WHY TWO STAGES
Real-world ranking needs signals that vectors cannot capture: user history, item freshness, business rules, real-time context. A pure vector search finds semantically similar items but misses these signals. Two-stage retrieval separates fast candidate generation (ANN) from expensive full ranking (ML model with rich features).
The first stage uses ANN to retrieve 200-1000 candidates at high recall (95-99%) within 5-20ms. The second stage applies a full ranking model with 50-200 features, including non-vector signals like click history, inventory status, and user segments. Final top-K (10-50) items returned to user.
CANDIDATE GENERATION (STAGE 1)
ANN retrieves candidates likely to be relevant based on embedding similarity. The key metric is recall: what fraction of the true top-K items appear in the candidate set? If true #1 item is not in candidates, no amount of re-ranking can surface it.
Over-retrieve to ensure coverage. If final output is top-10, retrieve 100-500 candidates. The ratio depends on how much re-ranking might reorder results. Highly personalized re-rankers need larger candidate sets.
Multiple retrieval paths: query embedding matches item embeddings, but also retrieve by user history similarity, trending items in category, and business-promoted items. Merge candidates from all paths.
RE-RANKING (STAGE 2)
Re-ranker is a full ML model (gradient boosted trees, neural network) with access to rich features. Feature categories:
User features: click history, purchase history, demographic segments, session context, real-time behavior.
Item features: popularity, recency, inventory, category, price, margin.
Cross features: user-item affinity scores, collaborative filtering signals, price sensitivity match.
Re-ranker computes score for each candidate, sorts, returns top-K. Latency budget: 20-50ms for 500 candidates. Requires efficient feature lookup and model inference.
LATENCY BUDGET ALLOCATION
Total budget: 100ms. Typical split: 10ms ANN retrieval, 20ms feature lookup, 30ms model inference, 40ms buffer for network and processing. If any stage exceeds budget, latency target is missed.