Learn→A/B Testing & Experimentation→Multi-Armed Bandits (Thompson Sampling, UCB)→2 of 6

A/B Testing & Experimentation • Multi-Armed Bandits (Thompson Sampling, UCB)Medium⏱️ ~2 min

Upper Confidence Bound (UCB): Optimism Under Uncertainty

THE UCB PRINCIPLE
Upper Confidence Bound (UCB) embodies the principle of "optimism in the face of uncertainty." For each arm, it maintains an estimate of the expected reward plus a bonus for uncertainty. Arms that have been tried less frequently have higher uncertainty, so they get a bigger bonus. This bonus naturally decays as you collect more observations.
THE UCB1 FORMULA
The classic UCB1 algorithm selects the arm with the highest score: score(a) = Q(a) + c × sqrt(ln(t) / N(a)) where Q(a) is the empirical mean reward for arm a, N(a) is the number of times arm a has been pulled, t is the total number of pulls so far, and c controls exploration intensity (typically sqrt(2)).
The uncertainty bonus sqrt(ln(t) / N(a)) shrinks as you pull arm a more often (N(a) grows) but grows slowly with total time (ln(t)). This ensures arms that have not been tried recently eventually get explored.
TUNING THE EXPLORATION PARAMETER
The parameter c controls the exploration-exploitation balance. Higher c means more exploration: arms with few trials get larger bonuses. Lower c means more exploitation: stick with high performers faster. In practice, c = sqrt(2) works well for bounded rewards in [0,1]. For unbounded rewards, tune c based on the reward scale.
⚠️ Key Trade-off: UCB is deterministic: given the same history, it always picks the same arm. This makes debugging easy but can be exploited if adversaries observe your choices.
WHEN UCB WORKS BEST
UCB excels with stationary reward distributions (arms do not change over time), moderate number of arms (5-20), and when you want interpretable, reproducible behavior. It struggles with heavy tailed rewards where occasional outliers inflate mean estimates.

💡 Key Takeaways

✓UCB1: score(a) = Q(a) + c × sqrt(ln(t)/N(a)) where uncertainty bonus decays as arm is pulled more

✓Parameter c controls exploration intensity; sqrt(2) is typical for [0,1] rewards

✓Deterministic: same history yields same choice, making debugging easy but potentially exploitable

✓Best for stationary problems with 5-20 arms; struggles with heavy tailed rewards

📌 Interview Tips

1When explaining UCB, write out the formula and explain each component: empirical mean, uncertainty bonus, and exploration parameter

2Mention that UCB is deterministic, which aids debugging but differs from stochastic Thompson Sampling

3Discuss the trade-off: high c explores more but wastes traffic, low c exploits faster but may miss better arms

← Back to Multi-Armed Bandits (Thompson Sampling, UCB) Overview