Recommendation Systems • Diversity & Exploration (Multi-armed Bandits)Hard⏱️ ~3 min
Failure Modes: Misaligned Rewards, Training Serving Skew, and Non-Stationarity
Bandits optimize the reward signal you give them, so misaligned rewards cause catastrophic outcomes. Expedia optimized hero image CTR and discovered the system selected visually striking but misleading images (clickbait) that increased clicks but decreased bookings and increased bounce rate. The short term proxy (CTR within session) diverged from the true business goal (conversion). Fixing this requires shifting reward to conversion directly, using multi-objective rewards (weighted CTR × conversion rate), or running hierarchical bandits where a short term proxy is gated by long term validation.
Training serving skew occurs when the reward you measure differs from the action you took. If you log impressions but clicks arrive delayed or out of order, you may credit the wrong arm. If multiple bandits update shared state without coordination, race conditions corrupt statistics. Solutions include strict attribution validation in the feedback pipeline (was this arm actually shown to this user?), idempotency keys to deduplicate events, and short reward windows (Udemy's 15 minutes) to keep feedback attributable and avoid session end ambiguity.
Non-stationary environments mean reward distributions shift over time due to seasonality, trends, content fatigue, or behavior changes. UCB with cumulative statistics performs poorly because old data from different distributions pollutes current estimates. Epsilon greedy with constant step size (fixed learning rate, sliding window of last N samples) adapts faster. Thompson Sampling with decayed priors or periodic resets also works. Change point detection can trigger re-initialization when a distribution shift is detected.
Long tail and sparse traffic entities never converge. Expedia small properties lack enough impressions for bandits to learn which hero image is best. New content has zero history (cold start). Solutions include hierarchical Bayesian models that pool statistics across similar items (e.g., all hotels in a city share a prior), traffic based gating (only run bandits on contexts with >1000 impressions per week), or explore in bulk campaigns where you force randomization for a fixed period then select winners and freeze.
💡 Key Takeaways
•Misaligned reward signals cause bandits to optimize the wrong objective. Expedia CTR optimization selected clickbait images that harmed bookings. Fix by using conversion as direct reward, multi-objective synthesis (weighted CTR × conversion), or hierarchical bandits with short term proxy gated by long term A/B validation.
•Training serving skew happens when feedback attribution is incorrect (credited wrong arm due to delayed clicks, out of order events) or race conditions corrupt shared state. Require strict validation in streaming pipeline: was this arm actually shown? Use idempotency keys, short reward windows (15 minutes), and atomic increment operations.
•Non-stationary reward distributions (seasonality, trends, content fatigue) make historical data stale. UCB with cumulative statistics lags badly. Use epsilon greedy with sliding windows (last 10,000 samples), Thompson Sampling with decayed priors, or change point detection to trigger re-initialization when shifts detected.
•Long tail sparse traffic entities (Expedia small properties, new content) never accumulate enough samples to converge. Hierarchical Bayesian priors pool statistics across similar items (e.g., hotels in same city), traffic gating only runs bandits on high volume contexts (>1000 impressions/week), or explore in bulk then freeze winners.
•Positional bias and interaction effects mean a slate item's reward depends on context from other shown items. Naive bandits over-credit top positions. Solutions include per position bandits (Scribd), counterfactual correction with propensity scores, or interleaving experiments to isolate item quality from position.
•Adversarial or bot traffic inflates reward for some arms. Expedia streaming pipeline filters and deduplicates events. Require anomaly detection (sudden CTR spike), robust baselines (compare to historical ranges), and session based validation to exclude non-human traffic.
📌 Examples
Expedia clickbait failure: Thompson Sampling on CTR selected misleading hero images that increased clicks by 15% but decreased bookings by 8% and increased bounce by 12%. Solution was two-phase: bandit exploration on CTR for fast feedback, followed by A/B validation on conversion and bounce before adoption.
Udemy non-stationary handling: Composite reward of clicks + enrollments within 15 minute window. Decayed Thompson Sampling posteriors with periodic re-initialization every quarter to adapt to seasonal course popularity shifts (New Year resolutions, back to school traffic patterns).
Scribd long tail handling: Low traffic user segments (< 500 daily active users) pooled into a single "other" segment that shared a bandit. Per position bandits only ran on segments with >1000 impressions per position per week. Small segments saw static hand-curated layouts until sufficient traffic accumulated.
Training serving skew example: Click event arrives 30 seconds after impression, but user was shown a different arm in a subsequent request due to rapid browsing. Feedback pipeline validates: for each click event, lookup logged impression within last 60 seconds and verify arm ID matches before incrementing that arm's click count.