A/B Testing & ExperimentationMulti-Armed Bandits (Thompson Sampling, UCB)Hard⏱️ ~2 min

Production Architecture: Serving Bandits at Scale

Deploying bandits in production requires careful architecture to meet strict latency budgets while processing millions of decisions and learning continuously. The typical flow splits into a hot serving path and a warm update pipeline. The serving tier receives a request with context, scores all candidate arms using in memory parameters, selects an arm, and logs the decision with action probabilities. A separate streaming pipeline aggregates outcomes, computes updates to arm parameters, and pushes refreshed state to the serving tier every 30 to 60 seconds. For a commerce homepage selecting among 10 hero banners at 120 thousand requests per second peak with 50 milliseconds p99 end to end latency, you budget roughly 5 milliseconds for the bandit decision. Using Thompson Sampling with Beta Bernoulli, this is straightforward: maintain 10 pairs of integers in RAM for alpha and beta, sample 10 Beta variates, and take argmax. With contextual linear bandits and 50 to 100 features, precompute per arm inverse covariance matrices and weight vectors. Scoring then requires a handful of vector operations, which is submillisecond on a single CPU core. The streaming update job processes 2 million events per minute, aggregates counts and sufficient statistics in one to five minute windows, applies exponential decay with a 7 day half life to handle nonstationarity, and atomically pushes updated parameters to a distributed cache that the serving tier queries. Safety and monitoring are critical. Always log action propensities to enable Off Policy Evaluation (OPE) using Inverse Propensity Scoring (IPS) and doubly robust estimators. This lets you estimate the reward lift of a new policy offline before ramping online. Set hard constraints like minimum 2 percent traffic per arm during warmup to ensure cold start arms get explored. Enforce ceilings on exploration rate to limit regret during learning. Track metrics like reward relative to a shadow baseline, exploration rate, arm saturation, and parameter staleness. Alert if reward distribution shifts sharply or if chosen arm entropy collapses prematurely. Provide a kill switch that reverts to a fixed A/B split during incidents.
💡 Key Takeaways
Budget 5 milliseconds for bandit decision in 50 milliseconds p99 overall latency, with precomputed inverse covariance matrices for submillisecond scoring
Streaming pipeline processes 2 million events per minute, aggregates in one to five minute windows, and pushes updates every 30 to 60 seconds
Always log action propensities for Off Policy Evaluation with Inverse Propensity Scoring and doubly robust estimators before online ramp
Set minimum 2 percent traffic per arm during warmup to handle cold start, and exploration rate ceilings to limit regret during learning
Exponential decay with 7 day half life handles nonstationarity from seasonality and trends, preventing lock onto stale winners
📌 Examples
Commerce homepage at 120k QPS with Thompson Sampling: 10 Beta(alpha, beta) pairs in RAM, sample and argmax in under 1 millisecond per request
Contextual linear bandit with 50 features and 10 arms: precompute 10 inverse covariance matrices, score via vector dot products in submillisecond
Yahoo Front Page Today Module: streaming job aggregates millions of impressions, updates LinUCB parameters every few minutes, serves with tens of milliseconds latency
← Back to Multi-Armed Bandits (Thompson Sampling, UCB) Overview
Production Architecture: Serving Bandits at Scale | Multi-Armed Bandits (Thompson Sampling, UCB) - System Overflow