Training Dense Retrievers: Contrastive Learning and Hard Negatives
Training Data Structure
Each training example: (query, positive document, negative documents). Positives come from click logs, relevance judgments, or QA pairs. Negatives matter most for quality. Random negatives: Easy to distinguish, model learns little. Hard negatives: Documents that seem relevant but are not (high BM25 score but wrong answer). Mining hard negatives from the model"s own errors during training dramatically improves accuracy (10-20% gains).
In-Batch Negatives
Efficient negative sampling: treat other positives in the same batch as negatives. With batch size 128, each query gets 127 negatives for free. This works because random documents are unlikely to be relevant. Combine with a few mined hard negatives (1-3 per query) for best results. Larger batches improve training but require more GPU memory.
Two-Tower Architecture
Query and document encoders can share weights (siamese) or be separate (asymmetric). Shared weights: simpler, regularizes better, works well for similar-length inputs. Separate weights: query encoder handles short text, document encoder handles long text, each optimized for its task. Document embeddings are computed offline; only query encoding happens at search time (10-50ms latency).