Recommendation Systems • Retrieval & Ranking PipelineEasy⏱️ ~2 min
What is a Retrieval and Ranking Pipeline?
A retrieval and ranking pipeline is a two stage architecture that solves a fundamental constraint: you cannot apply expensive, accurate models to millions of items within milliseconds. The solution is elegant: split the work into two phases with different optimization goals.
Retrieval (stage one) is optimized for high recall and speed. It uses lightweight signals like sparse lexical matching (BM25), dense semantic embeddings, graph proximity, or collaborative filtering to rapidly narrow millions or billions of candidates down to hundreds or thousands. Think of it as casting a wide net to ensure you do not miss relevant items. This stage typically runs in single digit to tens of milliseconds.
Ranking (stage two) is optimized for precision. Now that you have a manageable candidate set, you can afford to spend compute on sophisticated models like deep neural rankers, cross encoders, or listwise optimizers. These evaluate each item deeply using hundreds of features and complex interactions to predict utility (relevance, engagement, conversion). This is where you decide the final order users will see.
The trade is fundamental: you exchange search space for model capacity. By reducing from millions to thousands in retrieval, you free up the compute budget to apply expensive models in ranking. This pattern powers Google Search, YouTube recommendations, Meta's feed and ads, LinkedIn job recommendations, and Retrieval Augmented Generation (RAG) systems where you retrieve many document chunks but re rank to select only the most relevant few for the language model context.
💡 Key Takeaways
•Retrieval optimizes for recall and speed (1 to 10 milliseconds per generator), reducing millions of items to hundreds or thousands using lightweight signals like BM25, dense embeddings, or graph proximity
•Ranking optimizes for precision and quality (30 to 100 milliseconds total), applying expensive deep neural networks or cross encoders to the smaller candidate set to predict engagement or relevance
•The fundamental trade off is search space for model capacity: by shrinking the search space 1000x in retrieval, you can afford 1000x more compute per item in ranking
•YouTube uses this pattern to go from millions of videos to a few hundred candidates via lightweight models, then ranks with deep personalized rankers trained on billions of interactions
•In RAG systems, retrieval fetches 25 to 100 document chunks (high recall), then a cross encoder re ranker selects the top 3 to 10 chunks (high precision) that fit the language model context window
📌 Examples
Google YouTube: Candidate generation pulls a few hundred videos from millions using collaborative filtering and topic matching in under 10ms. Deep neural ranker then scores these hundreds with personalized engagement predictions to select the final dozen for the homepage.
Meta Feed Ranking: FAISS based Approximate Nearest Neighbor (ANN) retrieval over learned embeddings narrows millions of posts to thousands in single digit milliseconds. DLRM (Deep Learning Recommendation Model) then ranks these thousands with hundreds of features to select the final feed within a total budget of tens of milliseconds.
RAG pipeline: Dense retrieval (sentence transformers) fetches top 25 chunks from a vector database in 15ms. Cross encoder re ranker scores all 25 pairs in 80ms and selects the top 3 most relevant chunks to pass to GPT 4, improving answer accuracy by 35% compared to retrieval only.