Recommendation SystemsDiversity & Exploration (Multi-armed Bandits)Medium⏱️ ~3 min

Core Bandit Algorithms: Epsilon Greedy, UCB, and Thompson Sampling

Core Concept
Epsilon-greedy is the simplest exploration strategy: with probability epsilon, show a random item; otherwise show the best predicted item. Easy to implement, easy to tune, but exploration is untargeted.

How Epsilon-Greedy Works

For each recommendation slot: generate random number between 0 and 1. If below epsilon (say 0.1), pick a random item from the catalog. If above epsilon, pick the highest-scoring item from your model. Typical epsilon values: 0.05 to 0.15 depending on how much exploration you can afford.

UCB: Upper Confidence Bound

UCB adds optimism to uncertainty. For each item, compute score = predicted_value + exploration_bonus. The bonus is larger for items with fewer observations. Items that have been shown rarely get higher bonuses, encouraging exploration. As observations accumulate, bonus shrinks and exploitation dominates.

Thompson Sampling

Model uncertainty as a probability distribution over the true reward. For each item, sample from its distribution and pick the item with highest sampled value. Items with high variance (uncertainty) occasionally sample high and get selected. Natural exploration without explicit exploration term. Often outperforms UCB in practice.

⚠️ Interview Pattern: When asked about exploration, explain epsilon-greedy first (simple baseline), then UCB (adds optimism to uncertainty), then Thompson Sampling (probabilistic, state of the art). Show you understand the progression from simple to sophisticated.
💡 Key Takeaways
Epsilon-greedy is simplest (explore with probability ε, typically 5-10%) but inefficient because it explores uniformly even when some arms are clearly inferior.
UCB (Upper Confidence Bound) selects the arm with highest upper confidence bound, providing principled exploration that decreases as observations accumulate.
Thompson Sampling maintains posterior distributions over arm rewards, samples from each, and picks the arm with highest sample. Naturally balances exploration with uncertainty.
Thompson Sampling empirically outperforms other algorithms in practice and is the production standard. It naturally transitions from exploration to exploitation as posteriors sharpen.
All algorithms eventually converge to the optimal arm; the difference is regret accumulated during learning. Thompson Sampling typically achieves near-optimal regret bounds.
📌 Interview Tips
1When explaining Thompson Sampling: describe Beta(successes+1, failures+1) posteriors for binary outcomes; sample from each arm, pick the highest sample; natural exploration from uncertainty.
2For interview depth: mention that TS is Bayes-optimal for single-period decisions and achieves near-optimal regret bounds without explicit exploration parameters to tune.
3When asked about convergence: explain that TS naturally reduces exploration as posteriors sharpen; arms with 1000 observations have tight distributions, rarely sampled if behind.
← Back to Diversity & Exploration (Multi-armed Bandits) Overview