Recommendation Systems • Two-Tower Models (User/Item Embeddings)Easy⏱️ ~3 min
What Are Two Tower Models and Why Use Them?
Two tower models, also called dual encoder models, are a neural architecture that learns two separate functions: one encoder for users and another for items. Each encoder transforms heterogeneous input features like user IDs, browsing history, device type, item metadata, and text descriptions into fixed size vectors called embeddings. Both towers output vectors in the same latent space, typically 64 to 256 dimensions, where similarity is measured using dot product or cosine similarity.
The key insight is architectural separation. Because the two towers are independent until the final similarity computation, you can precompute all item embeddings offline once and store them in a vector index. At serving time, you only need one forward pass through the user tower to generate the user embedding, then perform fast nearest neighbor search against millions or billions of precomputed item vectors. YouTube retrieves 100 to 1000 candidates in under 50ms at p99 from catalogs of hundreds of millions of videos using this pattern.
This is fundamentally different from cross encoder models that mix user and item features early with attention mechanisms. Cross encoders are more expressive but require computing a separate score for every user item pair, making them too slow for initial retrieval. Two tower models sacrifice some expressiveness for massive speed gains: you get sub 10ms retrieval over 100 million items on a single machine. Google, Meta, Spotify, and Netflix all use two tower architectures as their first stage candidate generator, feeding 200 to 1000 candidates into a slower but more accurate ranking model.
Think of two tower models as a generalization of classic matrix factorization. Matrix factorization only uses user_id and item_id; two tower models can incorporate dozens of features like recent click sequences, time of day, user location, item categories, and textual descriptions while maintaining the same fast serving pattern.
💡 Key Takeaways
•Two independent neural networks encode users and items into the same embedding space, typically 64 to 256 dimensions, where relevance is computed as dot product or cosine similarity between the vectors
•Item embeddings are precomputed offline and indexed, allowing online serving to only compute one user embedding and perform fast nearest neighbor search instead of scoring every item
•YouTube retrieves 100 to 1000 candidates in under 50ms at p99 from hundreds of millions of videos; Meta uses FAISS on billions of items with 5 to 15ms p95 latency per shard on GPU
•This architecture trades expressiveness for speed: no cross attention between user history and candidate items until the final similarity, but enables sub 10ms retrieval over 100 million items on a single machine
•Used as first stage candidate generation at Google, Meta, Netflix, and Spotify, feeding 200 to 1000 candidates into a slower but more accurate cross feature ranking model
•Generalizes matrix factorization by accepting rich features like click sequences, time context, categories, and text instead of only user_id and item_id
📌 Examples
Spotify encodes user listening history and track metadata into 128 dimensional embeddings, performing 1 to 5ms nearest neighbor search per shard across 100+ million tracks using Annoy indexes with nightly embedding refreshes
Google Search retrieval computes user tower online with query text and context, searches precomputed document embeddings using ScaNN with 2 to 10ms per query latency at 95% recall on 10 million vectors per shard