Recommendation Systems • Diversity & Exploration (Multi-armed Bandits)Easy⏱️ ~3 min
What are Multi-Armed Bandits and Why Use Them for Recommendations?
Multi-Armed Bandits (MAB) solve the exploration versus exploitation dilemma in recommendation systems. The name comes from slot machines (one armed bandits): you face K slot machines, each with an unknown payout rate, and must decide which to pull to maximize your total winnings. In recommendations, each "arm" is a content type, widget, ranking strategy, or item you could show users.
The core challenge is balancing two competing goals. Exploitation means showing what currently performs best based on data you have (maximize short term reward). Exploration means trying options with uncertain performance to discover potentially better choices (minimize regret, which is the opportunity cost versus the best possible strategy). Without exploration, you overfit to historical data and create popularity bias where new or niche content never gets exposure.
Bandits shine when you have many variants to test and the cost of showing poor options is high. Unlike A/B testing which splits traffic evenly across all variants for weeks, bandits adapt traffic in real time based on observed performance. At Scribd, testing all possible homepage layouts would require 5×10^15 experiments (combinatorially infeasible), but bandits converged to strong layouts within one week by dynamically allocating more traffic to better performing rows.
The reward signal is typically a fast feedback metric like Click-Through Rate (CTR) over a 15 minute window rather than slow metrics like 30 day retention. This keeps attribution clean and learning cycles short. Udemy uses clicks and enrollments within 15 minutes as their composite reward, while Expedia optimizes hero image CTR measured within the session.
💡 Key Takeaways
•Regret is the opportunity cost versus always picking the best arm. Bandit algorithms aim to minimize cumulative regret over time by balancing exploration of uncertain options with exploitation of known good options.
•Fast feedback loops are critical. Production systems use 15 minute reward windows (Udemy) or session based CTR (Expedia) rather than slow metrics like 30 day retention to keep learning cycles short and attribution clean.
•Bandits beat A/B testing when you have many variants and high opportunity cost. Scribd faced 5×10^15 possible homepage layouts (42 row types across 10 positions), making exhaustive A/B testing infeasible, but bandits converged within one week.
•The action space (arms) can be content types, widget positions, ranking strategies, or individual items. Scribd used 42 arms (row types), Expedia used up to 10 hero images per property, and systems must balance arm count against traffic needed per arm.
•Cold start and popularity bias are natural consequences without exploration. New content never accumulates data if you only exploit, starving niche items and reinforcing whatever was historically popular regardless of true quality.
📌 Examples
Scribd homepage: 30 total bandits (10 positions × 3 user segments), each with 42 arms (row types). Converged within 1 week, achieved +10% reads from recommendations overall, with best performing row seeing 4× activity uplift.
Udemy slate bandit: Optimizes top 3 recommendation units (k=3 based on viewport visibility), composite reward of clicks + enrollments within 15 minute window, using Thompson Sampling with decayed exploration over time.
Expedia hero images: Thompson Sampling on CTR with Beta-Bernoulli posteriors, up to 10 candidate images per property with enforced category diversity (room, lobby, exterior), exploration phase ~1 month followed by 2 week A/B validation.