Recommendation Systems • Diversity & Exploration (Multi-armed Bandits)Medium⏱️ ~3 min
Core Bandit Algorithms: Epsilon Greedy, UCB, and Thompson Sampling
Three classic algorithms solve the bandit problem with different exploration strategies. Epsilon greedy is the simplest: with probability ε (typically 0.1 or 10%), pick a random arm to explore; otherwise exploit the arm with highest observed mean reward. It wastes exploration on obviously bad arms but handles non-stationary environments well with constant step size updates, making it robust when user preferences shift over time.
Upper Confidence Bound (UCB) picks the arm with the highest optimistic estimate: mean reward plus an uncertainty bonus that decreases as you collect more samples. The formula is typically mean + sqrt(2×log(total_pulls) / arm_pulls). This guarantees you eventually try every arm enough times and has strong theoretical regret bounds in stationary problems. However, UCB struggles with non-stationary rewards because uncertainty decays too conservatively, and it becomes impractical with large action spaces (thousands of arms) since the confidence bounds are computed per arm.
Thompson Sampling takes a Bayesian approach: maintain a probability distribution (posterior) over each arm's true reward, sample once from each distribution, and pick the arm with highest sample. For binary rewards like CTR, this uses Beta-Bernoulli conjugate priors (start with Beta(1,1) or add pseudocounts like Beta(10,10) to prevent early overconfidence). Empirically, Thompson Sampling achieves the best regret across diverse problems and extends naturally to contextual features and complex reward models.
In production, Thompson Sampling dominates. Udemy uses it as their primary algorithm with decayed exploration over time. Expedia deployed Thompson Sampling with Beta-Bernoulli posteriors for hero image CTR optimization. The key advantage is it naturally balances uncertainty (wide posteriors explore more) and performance (high means exploit more) without manual tuning of exploration rates.
💡 Key Takeaways
•Epsilon greedy with ε=0.1 wastes 10% of traffic on random exploration including bad arms, but constant step size makes it robust to non-stationary environments where user preferences shift. Simple to implement with O(1) per request computation.
•UCB has strong theoretical regret bounds (logarithmic) in stationary problems but struggles when rewards change over time because confidence intervals decay too slowly. Becomes impractical with thousands of arms due to per arm computation of uncertainty bonuses.
•Thompson Sampling empirically outperforms others in practice and is the production standard at Udemy and Expedia. It naturally allocates more exploration to uncertain arms (wide posteriors) and more exploitation to high reward arms (high posterior means) without manual tuning.
•For binary rewards like CTR, use Beta-Bernoulli conjugate priors: Beta(clicks+α, no_clicks+β). Start with modest priors like Beta(10,10) rather than Beta(1,1) to prevent early overconfidence and premature convergence after just a few samples.
•All three algorithms require O(1) per request sampling (pick an arm) and O(1) feedback updates (increment counters), making them suitable for high throughput production systems. State needed per arm is just impression count and success count or running mean.
📌 Examples
Udemy Thompson Sampling implementation: Beta-Bernoulli posteriors for slate bandit optimizing top 3 units, composite reward of clicks + enrollments within 15 minutes, decayed exploration over time as posteriors sharpen.
Expedia Thompson Sampling for hero images: Beta(clicks+α, impressions-clicks+β) per image, sampled once per request to select image, streaming feedback pipeline updates counters in real time, required statistical significance versus incumbent before adopting new winner.
Epsilon greedy with ε=0.1 and sliding window: Keep last 10,000 samples per arm with constant step size updates to handle seasonal traffic patterns, discard old data to adapt quickly when CTR distributions shift.