Embeddings & Similarity SearchEmbedding Generation (BERT, Sentence-BERT, Graph Embeddings)Hard⏱️ ~3 min

Two Stage Retrieval: Candidate Generation and Re-ranking at Scale

Production retrieval systems decompose into two stages to balance latency, cost, and quality. Stage one uses fast approximate methods to retrieve 500 to 2000 candidates from a large corpus in under 20 milliseconds. Stage two applies more expensive models to re-rank a smaller set of 100 to 200 items in 30 to 80 milliseconds. This architecture exploits the quality versus compute tradeoff: simpler models scale to billions of items with vector indices, while complex models deliver precision on a focused subset. Candidate generation relies on embeddings and approximate nearest neighbor search. SBERT or two tower models compute query embeddings in 2 to 10 milliseconds on GPU, then search precomputed document or item embeddings in an HNSW or IVF index. At 100 million items, retrieval returns top 1000 candidates in 10 to 20 milliseconds with 0.88 to 0.92 recall. The tradeoff is that embedding models compress all context into a fixed size vector, losing nuance. For queries requiring precise phrase matching or negation, candidate generation may surface irrelevant results. Blending with keyword search using BM25 provides a safety net. Re-ranking applies cross encoders or gradient boosted trees that process query and candidate jointly. Cross encoders concatenate query and document, run a full transformer forward pass, and produce a single relevance score. This joint encoding captures fine grained interactions that embedding dot products miss, improving precision by 5 to 15 percent on top results. The cost is 10 to 100 times higher per item: 50 to 150 milliseconds per pair on CPU. Re-ranking only 100 to 200 candidates keeps total latency under budget. Some systems add a third stage with business rules, diversity constraints, and real time personalization signals before final ranking. YouTube and Google Search use variants of this pattern. Candidate generation pulls from multiple sources: embedding search, trending topics, user history. Re-ranking applies neural networks that incorporate position bias, click through rate predictions, and content quality scores. The final ranking merges these signals with freshness, diversity, and policy filters. Latency budgets are strict: user facing search aims for sub 200 millisecond end to end latency, with 150 milliseconds for ranking. At recommendation scale, candidate generation must handle tens of millions of users and hundreds of millions of items, driving heavy investment in embedding infrastructure and distributed indices.
💡 Key Takeaways
Two stage retrieval splits work: stage one retrieves 500 to 2000 candidates in under 20 milliseconds via embedding search, stage two re-ranks 100 to 200 with expensive models in 30 to 80 milliseconds
Candidate generation uses SBERT or two tower embeddings with approximate nearest neighbor indices, achieving 0.88 to 0.92 recall at 10 to 20 milliseconds for 100 million items
Re-ranking applies cross encoders that jointly encode query and candidate, improving precision by 5 to 15 percent but costing 10 to 100 times more compute per item (50 to 150 milliseconds per pair)
YouTube and Google Search use multi-source candidate generation (embeddings, trending, history) followed by neural re-ranking with click through rate prediction, position bias, and content quality signals
Latency budgets are strict: user facing search targets sub 200 millisecond end to end, allocating 150 milliseconds for ranking, driving careful stage partitioning and batching optimizations
Blending embedding retrieval with keyword search (BM25) provides safety net for queries needing precise phrase matching or negation that embeddings compress away
📌 Examples
E-commerce product search: SBERT retrieves 1000 candidates in 18 milliseconds from 50 million products, gradient boosted tree re-ranks top 200 in 45 milliseconds using price, reviews, and click features, returns final 20 in 80 milliseconds total
Google Search candidate generation pulls from embedding index, trending topics, and user history, re-ranking neural network scores with freshness, diversity, and policy filters before final ranking
Spotify Discover Weekly: graph embeddings retrieve 500 track candidates per user in 30 milliseconds, re-ranker applies contextual model with listening time of day and skip rate to order final playlist
← Back to Embedding Generation (BERT, Sentence-BERT, Graph Embeddings) Overview