Upper Confidence Bound (UCB): Optimism Under Uncertainty
THE UCB PRINCIPLE
Upper Confidence Bound (UCB) embodies the principle of "optimism in the face of uncertainty." For each arm, it maintains an estimate of the expected reward plus a bonus for uncertainty. Arms that have been tried less frequently have higher uncertainty, so they get a bigger bonus. This bonus naturally decays as you collect more observations.
THE UCB1 FORMULA
The classic UCB1 algorithm selects the arm with the highest score: score(a) = Q(a) + c × sqrt(ln(t) / N(a)) where Q(a) is the empirical mean reward for arm a, N(a) is the number of times arm a has been pulled, t is the total number of pulls so far, and c controls exploration intensity (typically sqrt(2)).
The uncertainty bonus sqrt(ln(t) / N(a)) shrinks as you pull arm a more often (N(a) grows) but grows slowly with total time (ln(t)). This ensures arms that have not been tried recently eventually get explored.
TUNING THE EXPLORATION PARAMETER
The parameter c controls the exploration-exploitation balance. Higher c means more exploration: arms with few trials get larger bonuses. Lower c means more exploitation: stick with high performers faster. In practice, c = sqrt(2) works well for bounded rewards in [0,1]. For unbounded rewards, tune c based on the reward scale.
WHEN UCB WORKS BEST
UCB excels with stationary reward distributions (arms do not change over time), moderate number of arms (5-20), and when you want interpretable, reproducible behavior. It struggles with heavy tailed rewards where occasional outliers inflate mean estimates.