Recommendation SystemsDiversity & Exploration (Multi-armed Bandits)Hard⏱️ ~3 min

Diversity Constraints and Convergence Monitoring in Production Bandits

Unconstrained bandits optimize short term reward and often converge to homogeneous, low diversity recommendations that harm long term engagement. If optimizing CTR alone, the system may show only clickbait images or the same content category repeatedly, creating filter bubbles and poor user experience. Expedia found that pure CTR optimization selected visually striking but misleading hero images that increased clicks but decreased downstream bookings, misaligning short term proxy with business goals. Diversity constraints enforce coverage across categories or content types. Expedia curated up to 10 candidate hero images per property with enforced category diversity (room, lobby, exterior, amenities, etc.) rather than allowing arbitrary images to compete. This ensures users see varied visual information about the property. The tradeoff is that enforcing diversity can slow convergence (need more samples across categories) and reduce short term CTR if diverse arms underperform initially, but it improves downstream metrics like conversion rate and reduces bounce. Convergence monitoring tells you when the bandit has learned enough to stop exploring. Udemy tracks "rate of change" in the top k composition: what percentage of the slate changes between consecutive time windows. Early on, the top 3 units change frequently as the algorithm explores. After sufficient data, changes plateau toward zero as posteriors sharpen and the system exploits the same high reward arms. Teams use this metric to decide when to taper exploration (decay ε in epsilon greedy, or rely on posterior shrinkage in Thompson Sampling). Guardrails prevent exploration from harming key metrics. Expedia required new images to achieve both highest CTR and statistical significance versus the incumbent before adoption, ensuring no regression. Scribd started with broad randomization for the first week to seed all arms with data, then allowed exploitation. Exploration budgets can be dynamically capped (limit exploration traffic to 20% if overall CTR drops below threshold) or use staged rollouts (bandit on 10% of traffic initially, expand after validation).
💡 Key Takeaways
Pure CTR optimization creates filter bubbles and homogeneous recommendations. Expedia found CTR favored clickbait images that harmed downstream bookings. Diversity constraints (category quotas, candidate curation) improve long term metrics at the cost of slower convergence and potentially lower short term CTR.
Candidate curation limits the action space while enforcing diversity. Expedia used up to 10 images per property with enforced category coverage (room, lobby, exterior, amenities). This prevents collapse to a single content type and ensures users see varied information.
Convergence monitoring via top k stability (Udemy approach) measures what percentage of the slate changes between time windows. High churn (40%) means exploring, low churn (2%) means converged and ready to taper exploration. This is more actionable than raw regret which is hard to measure in practice.
Guardrails prevent exploration from harming business metrics. Expedia required statistical significance and CTR improvement before adopting new winners. Exploration budgets can be dynamically capped (e.g., limit to 20% of traffic if metrics drop) or use staged rollouts (start on 10% traffic, expand after validation).
Multi-phase campaigns balance learning and validation. Expedia ran one month exploration (Thompson Sampling with initial random week) followed by two weeks A/B testing the bandit winner versus control to validate slow metrics (bookings, bounce rate) that weren't part of the fast feedback reward.
Long tail entities (low traffic properties in Expedia, niche content) never converge because they lack sufficient samples. Solutions include hierarchical priors (pool statistics across similar items), traffic gating (only run bandits on high volume contexts), or explore in bulk campaigns then fix winners.
📌 Examples
Expedia hero image diversity: Up to 10 candidates per property with category diversity (room, lobby, exterior, pool, dining, amenities). Thompson Sampling on CTR but new image required both highest CTR and statistical significance versus incumbent. Follow-up A/B validation phase checked bookings and bounce rate over 2 weeks.
Udemy convergence monitoring: Track percentage change in top 3 recommendation units week over week. Started at 40% churn (high exploration), decreased to 25% (learning), then 8% (converging), finally 2% (stable). At 2% churn, system tapered exploration and locked in to exploit mode.
Scribd exploration budget: First week full randomization across 42 row types to seed all arms with initial data (cold start). After week 1, bandits exploit based on accumulated statistics. If overall engagement drops below 95% of baseline, cap exploration traffic to 10% and force majority to best known arms.
← Back to Diversity & Exploration (Multi-armed Bandits) Overview