A/B Testing & ExperimentationMulti-Armed Bandits (Thompson Sampling, UCB)Hard⏱️ ~2 min

Failure Modes: Delayed Rewards and Nonstationarity

Bandits fail in predictable ways when real world conditions violate their assumptions. Delayed or censored rewards are a common killer. If conversions arrive hours or days after the decision, a naive bandit sees only immediate signals and will over-explore arms that produce fast but shallow engagement, like clickbait, while underweighting arms that yield delayed value, like high quality content that drives purchases later. Mitigate this by using proxy rewards that predict the delayed objective, such as weighted engagement signals including dwell time and scroll depth. Alternatively, use inverse propensity scoring to de-bias delayed feedback, or implement delayed feedback aware updates that pause exploitation for arms with many pending outcomes. Nonstationarity breaks the stationary reward assumption that underpins regret bounds. Seasonality, trends, product launches, or external events shift reward distributions over time. If you accumulate statistics indefinitely, the policy locks onto stale winners and misses new opportunities. For example, a news recommendation bandit trained on pre pandemic traffic fails when user behavior shifts dramatically. Mitigate by using sliding windows of days to weeks instead of lifetime statistics, applying exponential decay weights with a half life matching business dynamics, such as 7 days for e-commerce or 1 day for breaking news, or deploying change point detection that triggers parameter resets when distributions shift. Netflix and Spotify handle this with daily or hourly retraining windows to catch trending content. New arm cold start is another pain point. Without informative priors, both UCB and Thompson Sampling can under-explore new arms when incumbents have strong estimated means based on thousands of trials. A new homepage banner with zero observations might never get selected if existing banners show 12 percent click through rate. Mitigate with optimistic priors that give new arms the benefit of the doubt, forced minimum exploration where each arm receives at least 100 trials or 2 percent traffic, or a brief warmup period where all arms receive equal traffic for the first 1000 impressions. LinkedIn and Microsoft use forced exploration budgets to ensure new content gets a fair chance.
💡 Key Takeaways
Delayed rewards cause over-exploration of fast engagement arms like clickbait at the expense of high quality content with conversion delays of hours or days
Nonstationarity from seasonality or trends locks the policy onto stale winners unless you use sliding windows of 1 to 7 days or exponential decay with matching half life
New arm cold start leads to underexploration when incumbents have thousands of observations; mitigate with optimistic priors or forced 2 percent minimum traffic
Context leakage and feedback loops create self-selection bias when policies and user cohorts drift together; always log action probabilities for off policy monitoring
Heavy tailed noise causes UCB to underestimate uncertainty and overcommit to lucky arms; use robust mean estimators or switch to Thompson Sampling with heavy tailed priors
📌 Examples
E-commerce bandit optimizing for purchases: use dwell time and add to cart as proxy rewards since conversion arrives hours later, not seconds
News recommendation during breaking events: exponential decay with 1 day half life prevents lock onto yesterday's trending stories that are no longer relevant
New homepage banner launch: force 2 percent traffic for first 1000 impressions so it competes fairly against incumbent banners with 10k trials and 12 percent CTR
Content provider gaming CTR with clickbait: add secondary metrics like 30 second dwell time and multi-objective scoring to penalize shallow engagement
← Back to Multi-Armed Bandits (Thompson Sampling, UCB) Overview