Core Concept
Epsilon-greedy is the simplest exploration strategy: with probability epsilon, show a random item; otherwise show the best predicted item. Easy to implement, easy to tune, but exploration is untargeted.
How Epsilon-Greedy Works
For each recommendation slot: generate random number between 0 and 1. If below epsilon (say 0.1), pick a random item from the catalog. If above epsilon, pick the highest-scoring item from your model. Typical epsilon values: 0.05 to 0.15 depending on how much exploration you can afford.
UCB: Upper Confidence Bound
UCB adds optimism to uncertainty. For each item, compute score = predicted_value + exploration_bonus. The bonus is larger for items with fewer observations. Items that have been shown rarely get higher bonuses, encouraging exploration. As observations accumulate, bonus shrinks and exploitation dominates.
Thompson Sampling
Model uncertainty as a probability distribution over the true reward. For each item, sample from its distribution and pick the item with highest sampled value. Items with high variance (uncertainty) occasionally sample high and get selected. Natural exploration without explicit exploration term. Often outperforms UCB in practice.
⚠️ Interview Pattern: When asked about exploration, explain epsilon-greedy first (simple baseline), then UCB (adds optimism to uncertainty), then Thompson Sampling (probabilistic, state of the art). Show you understand the progression from simple to sophisticated.
✓Epsilon-greedy is simplest (explore with probability ε, typically 5-10%) but inefficient because it explores uniformly even when some arms are clearly inferior.
✓UCB (Upper Confidence Bound) selects the arm with highest upper confidence bound, providing principled exploration that decreases as observations accumulate.
✓Thompson Sampling maintains posterior distributions over arm rewards, samples from each, and picks the arm with highest sample. Naturally balances exploration with uncertainty.
✓Thompson Sampling empirically outperforms other algorithms in practice and is the production standard. It naturally transitions from exploration to exploitation as posteriors sharpen.
✓All algorithms eventually converge to the optimal arm; the difference is regret accumulated during learning. Thompson Sampling typically achieves near-optimal regret bounds.