Recommendation Systems • Retrieval & Ranking PipelineMedium⏱️ ~3 min
Retrieval Augmented Generation (RAG): Applying Retrieval and Ranking to Language Models
Retrieval Augmented Generation (RAG) applies the retrieval and ranking pipeline pattern to language models, solving the problem of limited context windows and knowledge cutoffs. A language model like GPT 4 can only process a fixed context (8K to 128K tokens) and knows nothing about proprietary documents or events after its training cutoff. RAG bridges this by retrieving relevant document chunks from an external corpus and injecting them into the prompt, grounding the model's generation in retrieved facts.
The pipeline mirrors recommendation systems. Retrieval (stage one) uses sparse (BM25), dense (sentence transformers, embeddings), or hybrid methods to fetch a high recall set of candidate chunks (typically 25 to 100) from a vector database or search index in 10 to 30 milliseconds. This casts a wide net to ensure relevant information is not missed. Re ranking (stage two) applies a cross encoder to score each (query, chunk) pair and selects the top K chunks (typically 3 to 10) that fit the language model's context window, optimizing for precision. This step is critical: without re ranking, mediocre chunks dilute the context and degrade answer quality.
Numbers from educational benchmarks illustrate the impact. One study found that hybrid retrieval (BM25 plus dense) combined with Hypothetical Document Expansion (HyDE, generating a hypothetical answer and using it as a retrieval query) maximized accuracy but took 11.7 seconds per query, too slow for interactive use. A more practical configuration retrieved 32 chunks and cross encoded them, achieving strong accuracy at under 100 milliseconds per query. Another common setup: retrieve 25 chunks, re rank to select top 3, improving answer relevance by 35 percent compared to retrieval only ranking.
Failure modes are instructive. Training serving skew (model trained on 512 token chunks but served with 256 token chunks) caused 20 percent accuracy drops. Poor chunking (splitting mid sentence or ignoring document structure) degraded retrieval recall by 15 to 30 percent. Over retrieval (passing 20 mediocre chunks instead of 5 good ones) increased hallucination rates because the model struggled to identify the signal. The lesson: RAG is not just about retrieval speed; chunking quality, re ranking precision, and alignment between training and serving are equally critical.
💡 Key Takeaways
•RAG retrieval targets high recall: fetch 25 to 100 candidate chunks using sparse (BM25), dense (sentence transformers), or hybrid methods in 10 to 30 milliseconds. This ensures relevant information is not missed due to paraphrase or semantic variation.
•Re ranking targets high precision: cross encoder scores all (query, chunk) pairs and selects top 3 to 10 chunks that fit the language model context window. This step improves answer accuracy by 30 to 40 percent compared to retrieval only ranking.
•Chunking is critical: split documents into 175 to 512 token chunks with 10 to 20 token overlaps to preserve context across boundaries. Poor chunking (mid sentence splits, no overlap) degrades retrieval recall by 15 to 30 percent.
•Hybrid retrieval (BM25 plus dense) is standard because it hedges failure modes: BM25 catches exact term matches, dense captures semantic similarity. One benchmark showed hybrid improving accuracy 18 percent over dense only.
•Training serving skew is a common failure: if training uses different chunk sizes, overlap, or text normalization than serving, accuracy drops 15 to 25 percent. Unify preprocessing pipelines to prevent this.
•Trade off example: hybrid retrieval plus HyDE maximized accuracy in one study but took 11.7 seconds per query. Practical production systems use hybrid plus cross encoder re ranking under 100 milliseconds per query for interactive use.
📌 Examples
Question answering over documentation: Retrieve top 50 chunks via hybrid search (BM25 plus sentence BERT) in 18ms. Cross encoder re ranks all 50 in 80ms, selects top 5. Pass 5 chunks (total 2000 tokens) to GPT 4 for answer generation. Answer accuracy 82 percent (F1 score), versus 58 percent without re ranking.
Customer support chatbot: Dense retrieval over 100K support articles fetches 32 candidate chunks in 12ms. Lightweight bi encoder prunes to 15 in 8ms. Cross encoder scores 15 pairs in 45ms, selects top 3. LLM generates answer grounded in those 3 chunks. Hallucination rate drops from 22 percent (retrieval only) to 7 percent (with re ranking).
Research paper search: Hybrid retrieval (lexical plus citation graph embeddings) returns 100 chunks in 25ms. Stage one ranker (late interaction) prunes to 20 in 10ms. Stage two cross encoder selects top 10 in 60ms. Total latency 95ms, Mean Average Precision (MAP) at 10 improves from 0.68 (retrieval only) to 0.81 (with cascade).