A/B Testing & ExperimentationMulti-Armed Bandits (Thompson Sampling, UCB)Easy⏱️ ~2 min

Multi-Armed Bandits: Balancing Exploration and Exploitation

Multi-armed bandits solve the fundamental problem of choosing between K options repeatedly to maximize cumulative reward when you don't know which option is best. The name comes from slot machines (one-armed bandits), but the real world applications span recommendation systems, ad selection, and content personalization. Every decision creates tension: do you exploit the option that appears best so far, or explore uncertain options to discover whether they might be even better? The key metric is regret, which measures reward lost compared to always picking the best arm in hindsight. Classic bandit algorithms achieve regret proportional to log T, where T is the number of decisions. This means average regret per decision approaches zero over time. In production at companies like Netflix, this translates to showing fewer suboptimal hero images as the system learns. Netflix reported 20 to 30 percent differences in click rate between artwork variants, making the exploration investment worthwhile. Bandits shine when you need fast adaptation with limited traffic. Unlike traditional A/B tests that split traffic evenly for weeks, bandits shift traffic toward better options within hours or days. Yahoo deployed bandits on their Front Page Today Module handling tens of millions of impressions per day, with per request computation under tens of milliseconds. The system continuously learns which news articles resonate with users while minimizing waste on poor performers.
💡 Key Takeaways
Regret grows as log T for optimal algorithms, meaning average regret per decision approaches zero as you make more decisions over time
Production systems at Yahoo handled tens of millions of impressions per day with under tens of milliseconds per request computation budget
Netflix observed 20 to 30 percent click rate differences between artwork variants, justifying exploration cost to find winners
Bandits front-load learning toward better arms compared to A/B tests, reducing wasted traffic on poor performers by days or weeks
Applications include hero image selection at 120 thousand requests per second, news article recommendation, and notification template optimization
📌 Examples
Commerce homepage selecting 1 of 10 hero banners at 120k requests per second peak with 50ms p99 latency, budgeting 5ms for bandit decision
Yahoo Front Page Today Module serving news recommendations with tens of millions of impressions daily and per request latency under 10ms
Netflix artwork personalization discovering that certain thumbnails drive 20 to 30 percent higher click through rate for the same title
← Back to Multi-Armed Bandits (Thompson Sampling, UCB) Overview
Multi-Armed Bandits: Balancing Exploration and Exploitation | Multi-Armed Bandits (Thompson Sampling, UCB) - System Overflow