Learn→A/B Testing & Experimentation→Multi-Armed Bandits (Thompson Sampling, UCB)→4 of 6

A/B Testing & Experimentation • Multi-Armed Bandits (Thompson Sampling, UCB)Hard⏱️ ~2 min

Production Architecture: Serving Bandits at Scale

LATENCY BUDGET
In a 50ms p99 request budget, bandit decision should take under 5ms. For Thompson Sampling with 10-20 arms: sampling Beta distributions and taking argmax is submillisecond. For UCB: computing the formula for each arm is similarly fast. The bottleneck is never the algorithm; it is fetching context features from external services.
STATE MANAGEMENT
Non-contextual bandits need minimal state: for Thompson Sampling, two integers (α, β) per arm. Store in Redis or local memory. Updates can be eventual: a 1-5 minute lag between events and parameter updates is acceptable. For 100k QPS, use a streaming pipeline that aggregates events in micro-batches (e.g., every 30-60 seconds) and pushes updated parameters to serving nodes.
LOGGING FOR OFFLINE EVALUATION
Always log: (1) Action taken. (2) Probability of that action (propensity). (3) Reward observed. (4) Context features if contextual. This enables Off-Policy Evaluation (OPE) using Inverse Propensity Scoring. Without propensities logged, you cannot evaluate new policies offline before deploying them. OPE lets you estimate how a new policy would perform using historical data from the current policy.
✅ Best Practice: Log action propensities for every decision. This enables offline policy evaluation and debugging without running live experiments.
COLD START AND MINIMUM TRAFFIC
New arms have no observations and wide uncertainty. To prevent them from being permanently ignored, set minimum traffic floors: each arm receives at least 2% of impressions during warmup. After an arm has 100-500 observations, remove the floor and let the bandit algorithm take over. This ensures fair evaluation before the bandit has enough data to make informed choices.

💡 Key Takeaways

✓Budget 5ms for bandit decision within 50ms p99 latency; algorithm is submillisecond, context fetching is the bottleneck

✓State is minimal: two integers per arm for Thompson; eventual consistency with 30-60 second update lag is acceptable

✓Always log action propensities for Off-Policy Evaluation using Inverse Propensity Scoring

✓Enforce 2% minimum traffic per arm during warmup to prevent cold start starvation

📌 Interview Tips

1When discussing production, mention the latency breakdown: algorithm <1ms, context fetch 2-3ms, total <5ms

2Explain the logging requirement: propensities enable Inverse Propensity Scoring for offline evaluation

3Describe cold start mitigation: 2% minimum traffic until 100-500 observations per arm

← Back to Multi-Armed Bandits (Thompson Sampling, UCB) Overview