Recommendation SystemsRetrieval & Ranking PipelineMedium⏱️ ~3 min

Ranking Cascades: Trading Off Quality and Latency with Multi Stage Rankers

After retrieval gives you hundreds or thousands of candidates, you face a new constraint: your best ranking model is too expensive to run on all of them within your latency budget. A cross encoder that jointly encodes query and item pairs might take 5 milliseconds per item. Scoring 2000 candidates would cost 10 seconds, blowing past any interactive latency target. The solution is a ranking cascade: a sequence of progressively more accurate but more expensive models, each pruning the candidate set for the next stage. Stage one is a lightweight ranker, often a shallow multilayer perceptron (MLP) or a late interaction model like ColBERT that precomputes item representations. It scores all N candidates (say 2000) in 10 to 30 milliseconds total and prunes to M candidates (say 300). Stage two is a heavier model, perhaps a deeper transformer or cross encoder, that scores these 300 in another 30 to 100 milliseconds and produces the final top K (say 50 items for display). Each stage spends more compute per item but on fewer items, keeping total latency manageable. The key insight is that most candidates are clearly irrelevant and do not need expensive evaluation. The lightweight first stage is accurate enough to discard the bottom 85% of candidates. You only pay for the expensive cross encoder on the top 15% where fine grained distinctions matter. Meta and Google publicly describe similar cascades for feed and ads ranking, where initial models run on thousands of items in tens of milliseconds, and final rankers run on hundreds in another 50 to 100 milliseconds. In RAG, a common cascade is: retrieve 25 to 100 chunks with dense or hybrid search, then cross encode all pairs (query, chunk) to re rank and select the top 3 to 10 for the language model. One benchmark reported retrieving 32 chunks and cross encoding them in a second stage, improving relevance significantly over retrieval only ranking. The trade off is clear: cascades add complexity (multiple models to train and serve) but enable quality improvements that a single stage budget cannot afford.
💡 Key Takeaways
Stage one uses a lightweight model (shallow MLP, late interaction) to score all N candidates (500 to 2000) in 10 to 30 milliseconds and prune to M candidates (100 to 300), discarding clearly irrelevant items cheaply
Stage two uses an expensive model (deep transformer, cross encoder) on the pruned M candidates in another 30 to 100 milliseconds, spending more compute per item where fine grained ranking matters most
The compute trade off is asymmetric: a cross encoder might be 20x slower per item than an MLP, but by pruning to 15% of candidates you reduce total cost by 5x while improving quality
RAG cascade example: retrieve 32 chunks with dense search (15ms), cross encode all 32 query chunk pairs (85ms at 2.7ms per pair), select top 5 for the language model context. This improved answer relevance by 28% over dense retrieval ranking alone in benchmarks.
Failure mode: if the first stage model is too aggressive and prunes truly relevant items, the second stage cannot recover them. Tune stage one for high recall (keep top 15 to 20 percent) rather than precision to protect overall quality.
📌 Examples
Meta feed ranking: First stage MLP scores 3000 posts with 50 features in 20ms (0.007ms per post), prunes to 400. Second stage deep neural network with 200 features and interaction layers scores 400 in 60ms (0.15ms per post), selects top 100 for the feed. End to end latency stays under 100ms at p95.
LinkedIn job search: Stage one scores 1500 jobs with a gradient boosted tree model in 15ms, keeps top 200. Stage two applies a listwise ranking transformer optimizing Normalized Discounted Cumulative Gain (NDCG) on those 200 in 50ms, producing the final 25 jobs displayed.
RAG pipeline: Retrieve 100 chunks via hybrid search (BM25 plus dense). Stage one: lightweight bi encoder re ranks all 100 by cosine similarity in 8ms, prunes to 25. Stage two: cross encoder scores 25 pairs in 70ms (2.8ms per pair), selects top 3 chunks for GPT 4 context. Total latency 93ms, answer accuracy improves 31% versus retrieval only.
← Back to Retrieval & Ranking Pipeline Overview