Learn→Recommendation Systems→Diversity & Exploration (Multi-armed Bandits)→6 of 6

Recommendation Systems • Diversity & Exploration (Multi-armed Bandits)Hard⏱️ ~3 min

Failure Modes: Misaligned Rewards, Training Serving Skew, and Non-Stationarity

Core Concept
Exploration strategies can backfire. Too much exploration annoys users. Poorly targeted exploration wastes impressions. Understanding failure modes helps calibrate exploration properly.
Exploration Fatigue
Users exposed to too many irrelevant items become frustrated. If exploration rate is 20% and exploration items convert at 50% of baseline, user experience degrades noticeably. Monitor user complaints and session abandonment rates segmented by exploration exposure.
Exploration on Wrong Items
Random exploration shows items completely irrelevant to the user. A gaming enthusiast sees baby products. Targeted exploration limits exploration to items plausibly relevant based on content similarity or segment affinity. Reduces wasted impressions while still collecting useful signals.
Insufficient Exploration Budget
With epsilon = 0.01, new items get almost no exposure. Cold start persists indefinitely. Models never improve. Minimum viable exploration is typically 5-10% of impressions. Below that, the system stagnates.
❗ Interview Deep-Dive: "How do you decide how much to explore?" Explain the trade-off: higher exploration hurts short-term engagement but improves long-term model quality and catalog health. Quantify: "We A/B tested 5%, 10%, and 15% exploration. 10% gave best 30-day retention despite 3% lower immediate CTR."

💡 Key Takeaways

✓Misaligned reward signals cause bandits to optimize the wrong objective. CTR optimization may select clickbait; use composite rewards (clicks × completion × satisfaction).

✓Logging integrity is critical: decision-time propensities must match logged propensities exactly. Version mismatches corrupt offline policy evaluation.

✓Non-stationarity from seasonal shifts, content changes, or user behavior evolution requires decaying posteriors or windowed statistics to adapt.

✓Long-tail sparse-traffic entities never accumulate enough samples to converge. Hierarchical bandits share information across similar arms to accelerate learning.

✓Positional bias and interaction effects mean a slate items reward depends on context from other shown items. Naive bandits ignore these dependencies.

✓Adversarial or bot traffic inflates reward for some arms. Streaming pipelines should filter and deduplicate events before updating bandit statistics.

📌 Interview Tips

1When asked about common failures: explain the clickbait problem - optimizing clicks alone selects misleading content; use composite rewards (clicks × completion) or satisfaction signals.

2For non-stationarity: describe decaying posteriors (multiply counts by 0.95 weekly) or windowed statistics to adapt when true arm values shift over time.

3When discussing low traffic: mention pooling sparse segments into umbrella groups that share bandit statistics, preventing individual segments from having insufficient data to converge.

← Back to Diversity & Exploration (Multi-armed Bandits) Overview