Key Trade Offs and When to Choose Two Tower Models

Two tower models trade expressiveness for speed and scalability. The late interaction design means user features and item features never cross until the final dot product. There is no attention mechanism that lets the model ask "which items in the user's history are most similar to this candidate item?" Cross encoder models with early feature interaction can achieve 2 to 5% higher ranking accuracy but require a forward pass per candidate, making them 100 to 1000× slower. For initial retrieval over millions of items, two tower is the only practical choice; cross encoders are reserved for reranking the top 200 to 1000 candidates.

Embedding dimension directly impacts memory, cache efficiency, and throughput. Increasing from 128 to 256 dimensions doubles memory footprint and memory bandwidth. For 200 million items, going from 128 to 256 dims increases storage from roughly 26 GB to 52 GB per shard after quantization. Quality gains flatten beyond 128 to 256 dims in most domains; Spotify and YouTube both use 128 to 192 dimensional embeddings. Starting at 64 or 128 dims and only increasing if offline recall@K metrics improve meaningfully is the recommended approach.

Freshness versus stability is a constant tension. Recomputing item embeddings every hour captures trending content and new items faster but increases infrastructure cost and can cause score drift that confuses ranking models downstream. Most teams settle on daily item embedding refreshes with separate real time signals like view counts or recency fed directly to the ranker. User embeddings are cheaper to compute per request, so incorporating last few interactions in real time (last 10 to 50 clicks) is common and improves Click Through Rate (CTR) by 3 to 7% compared to daily aggregated user profiles.

Two tower models are the right choice when your catalog exceeds 1 million items, you need strict latency budgets under 50ms at p99 for retrieval, and you can precompute item embeddings offline. For small catalogs under 100K items, you can brute force score all candidates with a cross encoder. For tasks requiring nuanced interaction like question answering or semantic search over short documents, cross attention models or rerankers are better. Hybrid approaches combining lexical search with two tower retrieval are common in search products to handle exact keyword matches alongside semantic similarity.

💡 Key Takeaways

•Late interaction means no cross attention between user history and candidate items; cross encoders achieve 2 to 5% better ranking accuracy but are 100 to 1000× slower and only feasible for reranking 200 to 1000 candidates

•Embedding dimension of 128 to 256 is typical; doubling from 128 to 256 dims doubles memory from 26 GB to 52 GB per shard for 200 million items with minimal quality gain beyond this range

•Daily item embedding refreshes balance freshness and stability; hourly updates capture trends faster but increase cost and cause score drift; user embeddings computed per request with last 10 to 50 clicks improve CTR by 3 to 7%

•Two tower is optimal when catalog exceeds 1 million items and p99 latency budget is under 50ms; for catalogs under 100K items, brute force cross encoder scoring is viable

•Hybrid lexical plus two tower retrieval is standard in search products to handle exact keyword matches alongside semantic similarity; Google and Bing both use this pattern

•ANN recall tuning trades latency for quality: increasing recall from 85% to 95% can double query latency from 5ms to 10ms but recover 10 to 15% of lost ranking quality

📌 Examples

Netflix uses 128 dimensional embeddings for title retrieval with daily batch updates; real time user interactions from the current session are encoded separately and merged at serving to capture immediate context without full embedding recomputation

eBay refreshes item embeddings nightly for 1.3 billion listings; real time inventory status and price changes are handled via separate filters before ANN search rather than embedding updates to keep latency under 20ms p95

← Back to Two-Tower Models (User/Item Embeddings) Overview