ML-Powered Search & RankingDense Retrieval (BERT-based Embeddings)Medium⏱️ ~2 min

Training Dense Retrievers: Contrastive Learning and Hard Negatives

Core Concept
Contrastive learning trains the encoder to push relevant query-document pairs close together in vector space while pushing irrelevant pairs apart. The loss function penalizes when negatives are closer to the query than positives.

Training Data Structure

Each training example: (query, positive document, negative documents). Positives come from click logs, relevance judgments, or QA pairs. Negatives matter most for quality. Random negatives: Easy to distinguish, model learns little. Hard negatives: Documents that seem relevant but are not (high BM25 score but wrong answer). Mining hard negatives from the model"s own errors during training dramatically improves accuracy (10-20% gains).

In-Batch Negatives

Efficient negative sampling: treat other positives in the same batch as negatives. With batch size 128, each query gets 127 negatives for free. This works because random documents are unlikely to be relevant. Combine with a few mined hard negatives (1-3 per query) for best results. Larger batches improve training but require more GPU memory.

Two-Tower Architecture

Query and document encoders can share weights (siamese) or be separate (asymmetric). Shared weights: simpler, regularizes better, works well for similar-length inputs. Separate weights: query encoder handles short text, document encoder handles long text, each optimized for its task. Document embeddings are computed offline; only query encoding happens at search time (10-50ms latency).

💡 Key Takeaways
Contrastive learning pushes positives close, negatives apart; quality depends on negative selection
Hard negatives (high BM25 but wrong) improve accuracy 10-20% over random negatives
In-batch negatives: batch size 128 gives 127 free negatives; combine with 1-3 hard negatives
Two-tower: query and doc encoders can share weights (siamese) or be separate (asymmetric)
Document embeddings precomputed offline; only query encoding at search time (10-50ms)
📌 Interview Tips
1Explain hard negatives concept with BM25 example - shows understanding of what makes training effective
2Describe in-batch negatives technique for efficient training
3Mention two-tower architecture options (siamese vs asymmetric) with trade-offs
← Back to Dense Retrieval (BERT-based Embeddings) Overview
Training Dense Retrievers: Contrastive Learning and Hard Negatives | Dense Retrieval (BERT-based Embeddings) - System Overflow