Learn→A/B Testing & Experimentation→Multi-Armed Bandits (Thompson Sampling, UCB)→1 of 6

A/B Testing & Experimentation • Multi-Armed Bandits (Thompson Sampling, UCB)Easy⏱️ ~2 min

Multi-Armed Bandits: Balancing Exploration and Exploitation

Definition
Multi-armed bandits are algorithms that balance exploration (trying new options to learn their quality) with exploitation (choosing the best known option). Named after slot machines, bandits solve the problem: how do you find the best option while wasting minimal traffic on inferior choices?
THE PROBLEM WITH A/B TESTING
Traditional A/B testing assigns traffic equally across variants until statistical significance. If you have 10 variants and one is clearly best after day 1, you still waste 90% of traffic on inferior variants for weeks. Bandits front load learning: as evidence accumulates, they automatically shift traffic toward winners while maintaining enough exploration to detect if conditions change.
REGRET: THE CORE METRIC
Bandits optimize for regret: the difference between the reward you received and what you would have received by always choosing the best arm. Optimal algorithms achieve logarithmic regret, meaning regret grows as log(T) over T decisions. This implies the average regret per decision approaches zero over time. In practice, this means the bandit converges to the optimal arm while minimizing wasted traffic.
💡 Key Insight: Bandits trade short-term reward (exploitation) for long-term learning (exploration). The key is doing this efficiently: explore enough to find winners, but not so much that you waste traffic.
COMMON APPLICATIONS
Hero image or thumbnail selection (5-20 variants), notification template optimization, homepage layout testing, and ad creative selection. The sweet spot is 5-50 discrete options where reward feedback is fast (seconds to minutes) and you have high traffic (thousands of decisions per day).

💡 Key Takeaways

✓Bandits balance exploration and exploitation, automatically shifting traffic toward winners as evidence accumulates

✓Regret measures wasted opportunity; optimal algorithms achieve log(T) regret, meaning average regret per decision approaches zero

✓Front loads learning compared to A/B testing: discovers winners days faster while wasting less traffic on losers

✓Best for 5-50 discrete options with fast reward feedback and high traffic (thousands of decisions per day)

📌 Interview Tips

1When asked about online optimization, explain the exploration-exploitation trade-off and why regret is the core metric

2Compare to A/B testing: equal allocation wastes 90% traffic on inferior variants, while bandits shift toward winners automatically

3Mention log(T) regret to show you understand optimal convergence properties of bandit algorithms

← Back to Multi-Armed Bandits (Thompson Sampling, UCB) Overview