Definition
Exploration-exploitation trade-off: Should you show items the model confidently predicts the user will like (exploit), or show items with uncertain predictions to learn more (explore)? Pure exploitation creates filter bubbles. Pure exploration annoys users with irrelevant content.
Why Diversity Matters
Users do not want 20 nearly identical recommendations. Even if the model predicts they all score 0.95, showing the same genre or brand 20 times is a poor experience. Diversity improves user satisfaction, catalog utilization, and long-term engagement even if it sacrifices short-term click-through rate.
Why Exploration Matters
Without exploration, new items never get exposure. User preferences never get updated. The system converges to a local optimum based on stale data. Exploration collects new signals that improve future predictions, even at the cost of current engagement.
Multi-Armed Bandits Framework
Think of each item as a slot machine (bandit) with unknown reward probability. You want to maximize total reward. Pulling only the best-known arm misses potentially better arms. Pulling random arms wastes pulls. Bandit algorithms balance this trade-off mathematically.
💡 Key Insight: Diversity and exploration serve different purposes. Diversity improves user experience in a single session. Exploration improves model quality over time. Both require sacrificing short-term engagement metrics for long-term gains.
✓Multi-armed bandits balance exploration (learning arm values) with exploitation (using best-known arm), adapting allocation during the experiment unlike fixed A/B splits.
✓Fast feedback loops are critical. Production systems use 15-minute to 24-hour reward windows; longer delays complicate credit assignment and slow learning.
✓Bandits beat A/B testing when you have many variants and high opportunity cost. With N variants and K positions, the state space can be enormous (N^K possible layouts).
✓The action space (arms) can be content types, widget positions, ranking strategies, or individual items. Typical deployments use 10-100 arms per bandit.
✓Convergence time depends on traffic volume and arm count. Low-traffic segments may take weeks to converge; high-traffic can converge in hours to days.
1When asked about bandits vs A/B testing: explain that bandits adapt during the experiment (allocating more traffic to winners), while A/B testing has fixed splits and waits for statistical significance.
2For use cases: mention that bandits excel when decisions are reversible and feedback is fast (UI variants, content slots) but A/B is better for irreversible changes (pricing, product features).
3When discussing exploration: explain the regret framework - every sub-optimal arm pull has an opportunity cost; bandits minimize cumulative regret over time.