Failure Modes and Production Operations
Embedding Staleness
User behavior changes faster than models retrain. A user who purchased running shoes yesterday still gets running shoe recommendations today, even if they switched to searching for formal shoes. The user tower uses outdated behavioral features.
The fix: incorporate real-time session features that update instantly, not just historical aggregates. Use the last 10-20 actions within the current session as input features. These reflect immediate intent, even if the model weights are a day old. Typical refresh: user embeddings recompute every request (sub-second freshness for session features), item embeddings rebuild hourly or daily.
Index-Model Mismatch
When you deploy a new model, the item index still contains embeddings from the old model. User embeddings from the new model and item embeddings from the old model live in different vector spaces. Dot products become meaningless. Retrieval recall drops to near-random.
The fix: coordinate model and index deployments. Before switching to a new model, rebuild the entire item index with new embeddings. Use blue-green deployment: serve traffic from the old index while building the new one, then switch atomically. Never mix embeddings from different model versions in the same serving path. Track embedding version as metadata; alert if versions mismatch.
Cold Start Degradation
New users get generic embeddings because the user tower has no historical features to work with. If 30% of traffic is new users, 30% of recommendations are essentially random. Worse: if new users do not engage, they never generate the data needed to improve their embeddings. This creates a retention cliff for new users.
The fix: design specific cold-start paths. Use content-based recommendations for new users until you have 5-10 interactions. Surface popular or trending items that have high base rates. Track cold-start users separately in metrics; compare their engagement to warm users. Set targets: cold-start click rate should be at least 60% of warm-user click rate within the first session.