A/B Testing & ExperimentationMulti-Armed Bandits (Thompson Sampling, UCB)Easy⏱️ ~2 min

Multi-Armed Bandits: Balancing Exploration and Exploitation

Definition
Multi-armed bandits are algorithms that balance exploration (trying new options to learn their quality) with exploitation (choosing the best known option). Named after slot machines, bandits solve the problem: how do you find the best option while wasting minimal traffic on inferior choices?

THE PROBLEM WITH A/B TESTING

Traditional A/B testing assigns traffic equally across variants until statistical significance. If you have 10 variants and one is clearly best after day 1, you still waste 90% of traffic on inferior variants for weeks. Bandits front load learning: as evidence accumulates, they automatically shift traffic toward winners while maintaining enough exploration to detect if conditions change.

REGRET: THE CORE METRIC

Bandits optimize for regret: the difference between the reward you received and what you would have received by always choosing the best arm. Optimal algorithms achieve logarithmic regret, meaning regret grows as log(T) over T decisions. This implies the average regret per decision approaches zero over time. In practice, this means the bandit converges to the optimal arm while minimizing wasted traffic.

💡 Key Insight: Bandits trade short-term reward (exploitation) for long-term learning (exploration). The key is doing this efficiently: explore enough to find winners, but not so much that you waste traffic.

COMMON APPLICATIONS

Hero image or thumbnail selection (5-20 variants), notification template optimization, homepage layout testing, and ad creative selection. The sweet spot is 5-50 discrete options where reward feedback is fast (seconds to minutes) and you have high traffic (thousands of decisions per day).

💡 Key Takeaways
Bandits balance exploration and exploitation, automatically shifting traffic toward winners as evidence accumulates
Regret measures wasted opportunity; optimal algorithms achieve log(T) regret, meaning average regret per decision approaches zero
Front loads learning compared to A/B testing: discovers winners days faster while wasting less traffic on losers
Best for 5-50 discrete options with fast reward feedback and high traffic (thousands of decisions per day)
📌 Interview Tips
1When asked about online optimization, explain the exploration-exploitation trade-off and why regret is the core metric
2Compare to A/B testing: equal allocation wastes 90% traffic on inferior variants, while bandits shift toward winners automatically
3Mention log(T) regret to show you understand optimal convergence properties of bandit algorithms
← Back to Multi-Armed Bandits (Thompson Sampling, UCB) Overview
Multi-Armed Bandits: Balancing Exploration and Exploitation | Multi-Armed Bandits (Thompson Sampling, UCB) - System Overflow