How Two Tower Architecture Works

Core Concept
Each tower is a neural network that converts raw features into a fixed-size vector (64-256 dimensions). The user tower processes user features. The item tower processes item features. Both output vectors in the same space so their dot product measures affinity.
What Each Tower Sees
The user tower receives everything known about the user at request time. This includes static features like user ID (mapped to a learnable embedding), demographics, and account age. It also includes dynamic features: recent interactions (last 50 items viewed, clicked, or purchased), current session context (device, location, time of day), and aggregated statistics (average order value, purchase frequency).
The item tower receives everything known about the item. Item ID maps to a learnable embedding that captures collaborative signals from training. Content features include title (passed through a text encoder), category hierarchy, price range, and item age. Image features come from a pre-trained vision model. Behavioral statistics include click rate, conversion rate, and return rate from historical data.
Inside A Tower
A typical tower stacks 2-4 fully connected layers. First, each input feature becomes a vector: categorical features like category or brand are embedded (learned during training), numerical features like price are normalized, and text features are encoded. All these vectors are concatenated into one long input vector, often 500-2000 dimensions.
This input passes through dense layers with ReLU activations. Layer sizes typically decrease: 1024 -> 512 -> 256 -> 128. The final layer has no activation, allowing positive and negative values in the output embedding. Batch normalization between layers stabilizes training. Dropout (0.1-0.3) prevents overfitting. Total parameters per tower: typically 1-10 million.
Computing The Score
User-item affinity is the dot product: score = sum(user_vector[i] * item_vector[i]) for all dimensions. With 128-dimension vectors, this is 128 multiplications and 127 additions. A GPU computes millions of these per second. Higher scores mean stronger predicted affinity.
Some systems normalize vectors to unit length and use cosine similarity instead. This bounds scores between -1 and +1 and ignores vector magnitude. The choice matters: dot product lets the model learn that some users engage more overall and some items are universally popular. Cosine similarity focuses purely on direction, which can be better when you want to emphasize relative preferences over absolute engagement levels.
💡 Interview Tip: When asked "how would you choose embedding dimension?", walk through this calculation: 10M items × 128 dimensions × 4 bytes = 5GB. Then explain the trade-off: larger dimensions capture more nuance but increase storage and latency linearly. Start with 64-128, benchmark retrieval recall versus latency, and only increase if recall is the bottleneck.

💡 Key Takeaways

✓User tower input: demographics, last 50 clicks, categories browsed, time of day, device. Item tower input: category, tags, title words, price, publish date, view count

✓Towers compress inputs through layers: 512 → 256 → 128. The final 128 numbers are the embedding used for similarity

✓History encoding option 1 - Average pooling: average embeddings of last N items. Simple, fast, but treats old clicks same as new

✓History encoding option 2 - Weighted average: learn weights per item (today click = 0.7, last month = 0.1). More expressive, slightly slower

✓History encoding option 3 - Sequential: process items in order to capture patterns like A→B suggests C. Most expressive, 3-5x slower than averaging

✓Towers MUST stay separate so item embeddings depend only on item features. Sharing information would require recomputing 100M item embeddings per request

📌 Interview Tips

1When asked about training at scale: explain in-batch negatives (using 1023 other items in batch as negatives) as a compute-efficient alternative to sampling from the full catalog.

2For interview depth: mention logQ correction at serving time to counteract popularity bias introduced during training (subtract log of item frequency from scores).

3When discussing loss functions: explain that softmax over dot products teaches the model to score positive items higher than negatives in the batch.

← Back to Two-Tower Models (User/Item Embeddings) Overview