Training Two Tower Models with In Batch Negatives

Two tower models are trained using sampled softmax or contrastive losses where each positive user item pair requires many negative items for comparison. The most scalable approach is in batch negative sampling: within each mini batch of say 512 positive pairs, every other item in that batch serves as a negative for each user. This gives you 511 negatives per positive example without additional data fetching, making training feasible on catalogs of millions or billions of items.

The loss function typically looks like a softmax over the batch: for user u and positive item i+, you compute exp(u · i+) / sum over all items j in batch of exp(u · j). The model learns to push the user embedding closer to the positive item and away from the batch negatives. Training requires careful shuffling so users in the same batch have diverse positives; otherwise negatives become trivial and the model does not learn meaningful distinctions.

A critical side effect is popularity bias. In batch negatives naturally oversample popular items because they appear more frequently in training data. The model ends up learning scores that correlate with Pointwise Mutual Information (PMI) rather than raw click probability: it predicts how much more likely a user is to click an item compared to a random user, not the absolute click probability. Without correction, serving will heavily favor head items and miss long tail content. Google and Meta apply log frequency corrections like subtracting log(item_frequency) from scores or adding learned bias terms to counteract this.

Production systems also use hard negative mining: deliberately sample negatives from the same category or event as the positive to avoid false negatives where a "negative" item is actually relevant. eBay samples hard negatives within the same product category; YouTube samples from videos the user was exposed to but did not click. This makes the contrastive task harder and improves model quality by 2 to 5% in offline recall metrics.

💡 Key Takeaways

•In batch negative sampling treats all other items in a mini batch as negatives for each positive pair, giving 511 negatives per positive in a batch of 512 without extra data fetching

•Loss is typically sampled softmax: exp(user · positive_item) divided by sum of exp(user · all_items_in_batch), pushing user embedding toward positive and away from negatives

•Popularity bias emerges because popular items appear more often as negatives, causing scores to align with Pointwise Mutual Information (PMI) instead of raw click probability

•Google and Meta correct bias by subtracting log(item_frequency) from scores at serving or learning additive bias terms during training to restore calibration and improve long tail exposure

•Hard negative mining samples negatives from the same category or event as the positive to avoid false negatives, improving offline recall by 2 to 5% at the cost of more complex data pipelines

•Batch construction must shuffle users to ensure diverse positives per batch; otherwise negatives are trivial and the model memorizes spurious patterns

📌 Examples

YouTube trains with batch size 1024, using 1023 in batch negatives per positive video click; applies logQ correction at serving by subtracting log(video_impressions) to counter popularity bias and improve coverage of niche content

eBay samples hard negatives from the same product category as the clicked listing during training, and uses constrained negatives based on user search query context to reduce false negatives and improve recall@100 by 4%

← Back to Two-Tower Models (User/Item Embeddings) Overview