Training Two-Tower Models

Core Concept
Two-tower training teaches the model to place user and item vectors close together when there is a positive interaction (click, purchase) and far apart otherwise. The challenge: you have billions of negative pairs but only millions of positive pairs. How you sample negatives determines model quality.
The Training Signal
For each positive pair (user U clicked item I), you need negatives: items U did not click. With 10 million items, you have 10 million potential negatives per positive. You cannot use all of them. Typical batch sizes sample 100-1000 negatives per positive.
The loss function says: score(U, I_positive) should be higher than score(U, I_negative) for all negatives. Softmax cross-entropy is common: compute softmax over the positive score and all negative scores, then minimize negative log-likelihood of the positive. This pushes positive scores up and negative scores down simultaneously.
In Batch Negatives
The simplest approach: within a batch of 512 user-item pairs, use the 511 other items as negatives for each user. You already computed their embeddings, so this adds zero extra computation. For user U with positive item I, the 511 items from other users become negatives.
The problem: batch negatives are random samples from the interaction distribution. They skew toward popular items and may be too easy. If the model just learns "user U does not want item I because I is a completely different category", it learns nothing useful. You need harder negatives that force the model to make fine distinctions.
Hard Negative Mining
Hard negatives are items similar to the positive that the user did not interact with. If user U clicked a blue Nike running shoe, a hard negative is a blue Adidas running shoe they saw but did not click. The model must learn why U preferred Nike over Adidas, not just "U likes shoes over laptops".
To find hard negatives: after initial training, run the model to find items with high scores that lack positive interactions. These are items the model thinks the user would like but they did not engage with. Mine these as negatives and retrain. This iterative process produces increasingly discriminative models. Two to three rounds of hard negative mining typically improve retrieval recall by 5-15%.
⚠️ Interview Question: "How do you handle negative sampling?" Start by explaining in-batch negatives (free, efficient), then discuss why hard negatives improve quality (force fine-grained distinctions). Mention temperature tuning: start at 0.1, tune based on validation recall. If asked about scale, note that 2-3 rounds of hard negative mining typically improves recall by 5-15%.

💡 Key Takeaways

✓Training goal: high similarity for clicks (positive pairs), low similarity for non-clicks (negative pairs). Model adjusts tower weights to push positives together, negatives apart

✓Problem: 100M items means cannot compare against all for each example. Computing 100M similarities per training step is impossibly slow

✓In-batch negatives: batch of 512 pairs → each user treats 511 other items as negatives. Embeddings already computed, so negatives are free. 100-500x more efficient

✓Why it works: for each user, model computes similarity to positive item and 511 batch items, then adjusts weights so positive scores highest

✓Popularity trap: popular items appear as negatives more often, so model learns to score them low. A popular item negative 1000x, positive 100x → model undervalues it

✓Bias fix: track negative/positive ratio per item, add correction proportional to log(ratio). Improves popular item recommendations by 10-20%

📌 Interview Tips

1For latency-focused questions: explain the recall-latency tradeoff in ANN - higher recall requires checking more candidates, increasing latency from 2ms to 15ms+ depending on configuration.

2When asked about sharding: mention that billion-scale indexes are typically sharded (10-100M vectors per shard), with each shard replicated 3-10x for fault tolerance and load distribution.

3For capacity planning: give concrete numbers - 64-dim float32 embeddings use 256 bytes per item; 100M items = 25GB raw, plus 2-3x for index structures.

← Back to Two-Tower Models (User/Item Embeddings) Overview