Failure Modes: Delayed Rewards and Nonstationarity
DELAYED REWARDS
Bandits assume rewards arrive quickly after decisions. But conversions may take hours or days. If you optimize for clicks (fast feedback), you may favor clickbait over high-quality content that converts later. The bandit sees the click, updates, and shifts traffic before the purchase (or lack thereof) arrives.
Mitigation: Use proxy rewards that correlate with delayed outcomes. Dwell time > 30 seconds predicts purchase intent. Add-to-cart predicts checkout. Train a model to estimate delayed reward from immediate signals, then use that estimate as the bandit reward. This converts a delayed reward problem into an immediate reward problem.
NONSTATIONARITY
Arms change over time. Seasonal trends, breaking news, and inventory changes mean yesterday's winner may be today's loser. Standard bandits with unbounded memory lock onto stale winners because historical evidence drowns new signals.
Mitigation: Use sliding windows (only consider last 7 days of data) or exponential decay (weight recent observations more heavily). For Thompson, multiply α and β by decay factor each day: α_new = 1 + decay × (α - 1). A 7-day half-life means observations from one week ago contribute half as much.
HEAVY TAILED REWARDS
UCB assumes bounded rewards and well-behaved noise. If occasional outliers produce 100x normal reward, UCB overestimates that arm's mean and overcommits. Fix: Use robust mean estimators (median of means) or switch to Thompson Sampling with heavy-tailed priors that handle outliers gracefully. Another option is to clip rewards to a maximum value, accepting some bias to reduce variance.