ML-Powered Search & Ranking • Dense Retrieval (BERT-based Embeddings)Medium⏱️ ~2 min
Training Dense Retrievers: Contrastive Learning and Hard Negatives
Training a dense retriever requires teaching the model which query document pairs should be close in embedding space and which should be far apart. Contrastive learning provides the framework: for each training query, you need positive examples (relevant documents) and negative examples (irrelevant documents). The loss function pushes positives closer while pulling negatives farther, creating a metric space where similarity scores correlate with relevance.
In batch negatives are the most efficient training strategy. In a batch of 64 query document pairs, each query treats the other 63 documents as negatives. This provides thousands of negative examples per gradient update without extra computation. The technique works because random documents are usually irrelevant to any given query. However, random negatives are often too easy, the model quickly learns to separate obviously different content but fails to discriminate subtle cases.
Hard negatives dramatically improve model quality. These are documents that score highly with simpler methods like Best Match 25 (BM25) or from a previous version of your dense retriever, but are actually not relevant. Mining hard negatives requires periodically querying your current index with training queries, retrieving top candidates, and selecting high scoring non relevant passages. Systems typically mine new hard negatives every few thousand training steps. Adding just 1 to 3 hard negatives per query alongside in batch negatives can improve recall at 100 by 5 to 10 percentage points.
Knowledge distillation from cross encoders provides another significant boost. A cross encoder processes query and document together with full token level attention, achieving higher accuracy but requiring scoring every candidate online, which is infeasible for initial retrieval. You can train a cross encoder on labeled data or click logs, then use it as a teacher. Generate scores for query document pairs, and train your dual encoder to match these scores. Distillation consistently improves retrieval quality by 3 to 7 points of recall while preserving the dual encoder speed advantage. Many production systems at Google and Microsoft use this pattern: distill from a strong cross encoder teacher into a fast dual encoder student for deployment.
💡 Key Takeaways
•In batch negatives provide thousands of negative examples per gradient update by treating other batch documents as negatives, efficient but sometimes too easy for the model
•Hard negatives mined from BM25 or a previous ANN retriever improve discrimination, typically boosting recall at 100 by 5 to 10 percentage points with just 1 to 3 hard negatives per query
•Cross encoder distillation transfers knowledge from a model that sees query and document jointly, improving retrieval quality by 3 to 7 recall points while keeping dual encoder speed
•Mining hard negatives requires periodically refreshing by querying the current index every few thousand training steps to keep negatives relevant as the model improves
•Training trade off: hard negative mining and distillation increase training cost and complexity but are necessary for competitive production quality
📌 Examples
Meta DPR training uses 1 gold positive passage per question plus 7 hard negatives from BM25, trained with in batch negatives across batch size 128, achieving 79% recall at 100 on Natural Questions
Microsoft Bing trains dual encoders with cross encoder distillation, where a 12 layer cross encoder teacher generates scores for 100K query document pairs per day used to train the dual encoder student
Training pipeline: mine hard negatives weekly by querying live index with 50K training queries, retrieve top 100, filter out positives, sample 2 hard negatives per query for next training cycle