How Two Tower Architecture Works
What Each Tower Sees
The user tower receives everything known about the user at request time. This includes static features like user ID (mapped to a learnable embedding), demographics, and account age. It also includes dynamic features: recent interactions (last 50 items viewed, clicked, or purchased), current session context (device, location, time of day), and aggregated statistics (average order value, purchase frequency).
The item tower receives everything known about the item. Item ID maps to a learnable embedding that captures collaborative signals from training. Content features include title (passed through a text encoder), category hierarchy, price range, and item age. Image features come from a pre-trained vision model. Behavioral statistics include click rate, conversion rate, and return rate from historical data.
Inside A Tower
A typical tower stacks 2-4 fully connected layers. First, each input feature becomes a vector: categorical features like category or brand are embedded (learned during training), numerical features like price are normalized, and text features are encoded. All these vectors are concatenated into one long input vector, often 500-2000 dimensions.
This input passes through dense layers with ReLU activations. Layer sizes typically decrease: 1024 -> 512 -> 256 -> 128. The final layer has no activation, allowing positive and negative values in the output embedding. Batch normalization between layers stabilizes training. Dropout (0.1-0.3) prevents overfitting. Total parameters per tower: typically 1-10 million.
Computing The Score
User-item affinity is the dot product: score = sum(user_vector[i] * item_vector[i]) for all dimensions. With 128-dimension vectors, this is 128 multiplications and 127 additions. A GPU computes millions of these per second. Higher scores mean stronger predicted affinity.
Some systems normalize vectors to unit length and use cosine similarity instead. This bounds scores between -1 and +1 and ignores vector magnitude. The choice matters: dot product lets the model learn that some users engage more overall and some items are universally popular. Cosine similarity focuses purely on direction, which can be better when you want to emphasize relative preferences over absolute engagement levels.