Learn→Embeddings & Similarity Search→Hard Negative Mining (Triplet Loss, Contrastive Learning)→3 of 5
Embeddings & Similarity Search • Hard Negative Mining (Triplet Loss, Contrastive Learning)Hard⏱️ ~3 min
Online vs Offline Hard Negative Mining Architecture
Production systems combine online and offline mining to balance compute efficiency with negative quality. Online mining selects negatives within the current mini batch during the training step. With data parallel training across 8 GPUs and a global batch size of 2048 pairs, each anchor can treat all 2047 other examples in the batch as potential negatives via cross replica gathering. The pairwise similarity matrix is computed once in O(batch size squared) complexity, then masked to select valid negatives. Batch hard mining picks the closest negative per anchor, while semi hard mining adds distance constraints to avoid extremely hard cases. This approach is compute friendly because it reuses already computed embeddings without extra forward passes.
Offline mining uses a prior model snapshot or retrieval system to precompute a pool of hard negative candidates. For a catalog of 100 million items, an offline job encodes the corpus using the previous checkpoint, builds or refreshes an Approximate Nearest Neighbor (ANN) index, then retrieves top 200 nearest items with different labels for each anchor. These candidates are stored with a time to live of 24 to 72 hours. During training, each batch samples 2 to 4 mined negatives from this pool per anchor. Offline mining finds better negatives across the full dataset, not just within a batch, capturing confusable items that rarely co occur in the same mini batch.
The trade off is infrastructure complexity versus negative quality. Online mining requires no extra storage or indexing, but is limited to batch diversity. If batches are small (under 256) or poorly shuffled, anchors see few meaningful negatives. Offline mining requires periodic re encoding, index rebuilding, and storage for candidate pools, adding latency and operational overhead. However, it surfaces negatives from tail classes or rare confusions that online mining misses. Staleness is a key failure mode: as the model drifts during training, yesterday's hard negatives become less relevant. Setting a time to live of 24 to 72 hours and refreshing pools regularly mitigates this.
Many systems layer both strategies. Spotify style dual encoders use in batch negatives for base coverage and offline mined candidates from skip logs for specific confusions. Pinterest product search combines random negatives (30%), in batch negatives (50%), and offline mined candidates (20%) in a curriculum that increases hardness over training. Memory queues like MoCo extend online mining by maintaining 32 thousand to 128 thousand embeddings from recent batches, giving each anchor access to 60 thousand plus negatives without the cost of offline indexing.
💡 Key Takeaways
•Online mining with global batch 2048 provides 2047 in batch negatives per anchor at zero extra compute, but limited to batch diversity and shuffling quality
•Offline mining retrieves top 200 candidates from 100 million item corpus using previous checkpoint, finding rare confusions that never co occur in same batch
•Staleness window of 24 to 72 hours balances freshness versus compute cost. Longer windows risk overfitting to outdated model errors as parameters drift
•Memory queues like MoCo maintain 32 thousand to 128 thousand embeddings from recent batches, extending effective negatives to 60 thousand plus without offline indexing overhead
•Production systems mix sources: Pinterest uses 30% random, 50% in batch, 20% offline mined with curriculum that increases offline proportion over training
•Infrastructure trade off: Online mining adds no storage or latency but misses tail cases. Offline mining requires daily re encoding jobs and ANN index refreshes costing hours of compute
📌 Examples
Spotify dual encoder training: 8 GPUs, batch 256 per GPU, global batch 2048. Cross replica gather gives 2047 in batch negatives. Additionally, sample 2 offline mined negatives from skip logs per anchor. Total 2049 negatives per step.
Pinterest product search offline miner: Daily job encodes 100 million products using yesterday's checkpoint. Builds HNSW index with 32 connections per layer. Retrieves top 200 visually similar products with different category labels per anchor. Stores candidates in distributed cache with 48 hour TTL.
Face recognition system: Online semi hard mining within batch of 1024 face images. No offline mining because full dataset is 500 million images, making daily re encoding prohibitively expensive. Instead, use stratified sampling to ensure each batch has good intra batch diversity.