Retrieval and Ranking Failure Modes in Production
RETRIEVAL BLIND SPOTS
The most dangerous failure: retrieval systematically misses good items. If the embedding model was trained on genre co occurrence, it might place jazz and classical far apart, so jazz fans never see classical recommendations. Detect by sampling users, running exhaustive scoring offline, comparing to retrieval candidates. If exhaustive scoring finds items retrieval missed that the ranker would rank highly, you have a blind spot.
STAGE MISALIGNMENT
When stages disagree about quality, the cascade amplifies errors. If L1 uses popularity but L3 optimizes personalization, L1 filters niche items L3 would rank highly. Measure correlation between stage scores for surviving items. Below 0.6 indicates misalignment. Fix by distilling L3 knowledge into L1.
COLD START
New items have no history, so embedding retrieval cannot find them. New users have no profile, so ranking falls back to popularity. For new items: use content based retrieval. For new users: use onboarding signals or trending content. Cold start windows: 7 to 14 days for items, 3 to 7 days for users before embeddings become reliable.
LATENCY SPIKES
P99 latency often exceeds median by 5 to 10 times due to GC pauses or unbalanced candidates. When one retriever returns 10,000 instead of 1,000 candidates, ranking spikes. Protect with 50ms stage timeouts, 5,000 candidate caps, circuit breakers for slow retrievers.