Learn→ML-Powered Search & Ranking→Dense Retrieval (BERT-based Embeddings)→2 of 6

ML-Powered Search & Ranking • Dense Retrieval (BERT-based Embeddings)Medium⏱️ ~2 min

Training Dense Retrievers: Contrastive Learning and Hard Negatives

Core Concept
Contrastive learning trains the encoder to push relevant query-document pairs close together in vector space while pushing irrelevant pairs apart. The loss function penalizes when negatives are closer to the query than positives.
Training Data Structure
Each training example: (query, positive document, negative documents). Positives come from click logs, relevance judgments, or QA pairs. Negatives matter most for quality. Random negatives: Easy to distinguish, model learns little. Hard negatives: Documents that seem relevant but are not (high BM25 score but wrong answer). Mining hard negatives from the model"s own errors during training dramatically improves accuracy (10-20% gains).
In-Batch Negatives
Efficient negative sampling: treat other positives in the same batch as negatives. With batch size 128, each query gets 127 negatives for free. This works because random documents are unlikely to be relevant. Combine with a few mined hard negatives (1-3 per query) for best results. Larger batches improve training but require more GPU memory.
Two-Tower Architecture
Query and document encoders can share weights (siamese) or be separate (asymmetric). Shared weights: simpler, regularizes better, works well for similar-length inputs. Separate weights: query encoder handles short text, document encoder handles long text, each optimized for its task. Document embeddings are computed offline; only query encoding happens at search time (10-50ms latency).

💡 Key Takeaways

✓Contrastive learning pushes positives close, negatives apart; quality depends on negative selection

✓Hard negatives (high BM25 but wrong) improve accuracy 10-20% over random negatives

✓In-batch negatives: batch size 128 gives 127 free negatives; combine with 1-3 hard negatives

✓Two-tower: query and doc encoders can share weights (siamese) or be separate (asymmetric)

✓Document embeddings precomputed offline; only query encoding at search time (10-50ms)

📌 Interview Tips

1Explain hard negatives concept with BM25 example - shows understanding of what makes training effective

2Describe in-batch negatives technique for efficient training

3Mention two-tower architecture options (siamese vs asymmetric) with trade-offs

← Back to Dense Retrieval (BERT-based Embeddings) Overview