Production Architecture: Serving Bandits at Scale
LATENCY BUDGET
In a 50ms p99 request budget, bandit decision should take under 5ms. For Thompson Sampling with 10-20 arms: sampling Beta distributions and taking argmax is submillisecond. For UCB: computing the formula for each arm is similarly fast. The bottleneck is never the algorithm; it is fetching context features from external services.
STATE MANAGEMENT
Non-contextual bandits need minimal state: for Thompson Sampling, two integers (α, β) per arm. Store in Redis or local memory. Updates can be eventual: a 1-5 minute lag between events and parameter updates is acceptable. For 100k QPS, use a streaming pipeline that aggregates events in micro-batches (e.g., every 30-60 seconds) and pushes updated parameters to serving nodes.
LOGGING FOR OFFLINE EVALUATION
Always log: (1) Action taken. (2) Probability of that action (propensity). (3) Reward observed. (4) Context features if contextual. This enables Off-Policy Evaluation (OPE) using Inverse Propensity Scoring. Without propensities logged, you cannot evaluate new policies offline before deploying them. OPE lets you estimate how a new policy would perform using historical data from the current policy.
COLD START AND MINIMUM TRAFFIC
New arms have no observations and wide uncertainty. To prevent them from being permanently ignored, set minimum traffic floors: each arm receives at least 2% of impressions during warmup. After an arm has 100-500 observations, remove the floor and let the bandit algorithm take over. This ensures fair evaluation before the bandit has enough data to make informed choices.