What Are Two Tower Models and Why Use Them?

Definition
Two-tower models learn separate neural networks for users and items that map both into a shared vector space. User-item affinity is the dot product of their vectors. This separation enables pre-computing item vectors offline while computing user vectors in real-time.
The Core Problem
Ranking millions of items per request is too slow. A neural network that scores one user-item pair in 0.1ms would take 100 seconds to score 1 million items. Users expect results in under 100ms. You cannot run a complex model on every item for every request.
The naive solution is to pre-compute scores for all user-item pairs. But with 100 million users and 10 million items, that is 10^15 pairs. Storing them requires petabytes. Updating them when user behavior changes is impossible. You need a smarter architecture.
The Two-Tower Insight
Instead of learning a score directly, learn to place users and items in the same vector space. Users who like similar things cluster together. Items that appeal to similar users cluster together. A user vector close to an item vector means high affinity.
The key insight: user vectors and item vectors are computed independently. The user tower only sees user features. The item tower only sees item features. They never see each other during the forward pass. This independence is what makes the architecture fast.
Why This Makes Retrieval Fast
Item vectors depend only on item features like title, category, and price. These change rarely. Compute all item vectors once, store them in an index. When a new item arrives, compute its vector and add it to the index. This is a batch job that runs hourly or daily.
User vectors depend on user features and recent behavior. Compute these at request time. One user vector takes 1-5ms. Then use approximate nearest neighbor (ANN) search to find the closest item vectors. ANN algorithms like HNSW find top 1000 items from 10 million in 5-10ms. Total retrieval time: under 20ms.
⚠️ Key Trade-off: Two-tower models cannot learn cross-features between user and item at retrieval time. A feature like "user prefers items priced 20% below their historical average" requires knowing both user history and item price simultaneously. The towers are separate, so this interaction cannot be captured. Solution: use two-tower for fast retrieval of candidates, then apply a cross-feature ranking model on the smaller candidate set.

💡 Key Takeaways

✓Two separate neural networks encode users and items into vectors of the same dimension (typically 128), where similarity is computed as a dot product

✓Item vectors are pre-computed offline; at serving time you compute only one user vector and search for similar items, reducing 100M scoring operations to one vector computation plus a fast lookup

✓Dot product similarity: multiply corresponding dimensions and sum. User [0.5, 0.3, 0.8] with item [0.4, 0.6, 0.7] gives (0.5×0.4)+(0.3×0.6)+(0.8×0.7)=0.94

✓This architecture trades expressiveness for speed: no cross-feature interactions between user and item until the final similarity, but enables sub-50ms retrieval over 100M items

✓Typical serving: compute user embedding (1-10ms) then ANN search over pre-indexed items (5-15ms) to retrieve top 100-1000 candidates

✓Used as first-stage candidate generation, feeding 200-1000 candidates into a slower but more accurate cross-feature ranking model

📌 Interview Tips

1When asked about candidate retrieval in system design: explain how two-tower architecture enables sub-10ms latency by precomputing item embeddings and only computing user embeddings online.

2For scale questions: mention that item embeddings can be precomputed for millions/billions of items, while user embeddings are computed per request using recent interaction signals.

3When discussing cold start: explain that new items can be indexed immediately using content features (text, images) even without interaction data.

← Back to Two-Tower Models (User/Item Embeddings) Overview