Two Stage Retrieval: Candidate Generation and Re-ranking at Scale
WHY TWO STAGES
Embedding search is fast but approximate—it captures semantic similarity but misses personalization, freshness, and business rules. Rich ML models are accurate but slow—they can incorporate hundreds of features but cannot score millions of items. Two-stage retrieval combines both strengths.
Stage 1 (retrieval): ANN search returns 100-1000 candidates in 5-20ms. The metric is recall—did we retrieve items that are truly relevant? This stage filters millions down to hundreds.
Stage 2 (re-ranking): An ML model scores each candidate using rich features (user history, item popularity, price, freshness). Returns final top 10-50. The metric is precision and NDCG—did we rank the relevant items correctly?
CANDIDATE SET SIZING
Retrieve enough candidates that true positives are included. If final output is top-10 recommendations, retrieve 100-500 candidates. The exact ratio depends on embedding quality and re-ranker power.
Under-retrieval risk: true positives missing from candidate set. The re-ranker cannot surface items it never sees—recall is capped at Stage 1. Over-retrieval cost: re-ranker latency increases linearly with candidate count. Find the minimum candidate set that maintains target recall.
RE-RANKER FEATURES
User features: click history, purchase history, demographic segments, session context, device type.
Item features: popularity score, days since creation, price tier, category, inventory status.
Cross features: user-item affinity scores, collaborative filtering signals, price sensitivity match—personalization signals that embeddings cannot capture.