Learn→A/B Testing & Experimentation→Multi-Armed Bandits (Thompson Sampling, UCB)→5 of 6

A/B Testing & Experimentation • Multi-Armed Bandits (Thompson Sampling, UCB)Hard⏱️ ~2 min

Failure Modes: Delayed Rewards and Nonstationarity

DELAYED REWARDS
Bandits assume rewards arrive quickly after decisions. But conversions may take hours or days. If you optimize for clicks (fast feedback), you may favor clickbait over high-quality content that converts later. The bandit sees the click, updates, and shifts traffic before the purchase (or lack thereof) arrives.
Mitigation: Use proxy rewards that correlate with delayed outcomes. Dwell time > 30 seconds predicts purchase intent. Add-to-cart predicts checkout. Train a model to estimate delayed reward from immediate signals, then use that estimate as the bandit reward. This converts a delayed reward problem into an immediate reward problem.
NONSTATIONARITY
Arms change over time. Seasonal trends, breaking news, and inventory changes mean yesterday's winner may be today's loser. Standard bandits with unbounded memory lock onto stale winners because historical evidence drowns new signals.
Mitigation: Use sliding windows (only consider last 7 days of data) or exponential decay (weight recent observations more heavily). For Thompson, multiply α and β by decay factor each day: α_new = 1 + decay × (α - 1). A 7-day half-life means observations from one week ago contribute half as much.
⚠️ Key Trade-off: Shorter decay windows adapt faster to change but require more samples to converge. Balance based on how quickly your domain changes.
HEAVY TAILED REWARDS
UCB assumes bounded rewards and well-behaved noise. If occasional outliers produce 100x normal reward, UCB overestimates that arm's mean and overcommits. Fix: Use robust mean estimators (median of means) or switch to Thompson Sampling with heavy-tailed priors that handle outliers gracefully. Another option is to clip rewards to a maximum value, accepting some bias to reduce variance.

💡 Key Takeaways

✓Delayed rewards favor fast-feedback arms like clickbait; use proxy rewards like dwell time that predict delayed conversions

✓Nonstationarity locks onto stale winners; use 7-day sliding windows or exponential decay to forget old observations

✓Exponential decay formula for Thompson: α_new = 1 + decay × (α - 1) with typical 7-day half-life

✓Heavy tailed rewards break UCB mean estimates; use robust estimators or Thompson with heavy-tailed priors

📌 Interview Tips

1When discussing delayed rewards, explain the clickbait problem and suggest proxy metrics like 30-second dwell time

2Describe exponential decay: 7-day half-life means week-old observations contribute 50% weight

3For heavy-tailed rewards, recommend Thompson over UCB because posteriors handle outliers more gracefully

← Back to Multi-Armed Bandits (Thompson Sampling, UCB) Overview