Learn→A/B Testing & Experimentation→Multi-Armed Bandits (Thompson Sampling, UCB)→3 of 6

A/B Testing & Experimentation • Multi-Armed Bandits (Thompson Sampling, UCB)Medium⏱️ ~2 min

Thompson Sampling: Bayesian Probability Matching

THE BAYESIAN APPROACH
Thompson Sampling maintains a probability distribution (posterior) over each arm's true reward rate. Instead of computing a single point estimate, it treats the reward rate as uncertain. For binary rewards (click/no-click), the Beta distribution is the natural choice. Each arm starts with Beta(1,1), which is uniform over [0,1], representing complete uncertainty.
HOW IT WORKS
At each decision: (1) Sample one value from each arm's Beta posterior. (2) Select the arm with the highest sample. (3) Observe the reward. (4) Update that arm's posterior: Beta(α, β) → Beta(α + success, β + failure). The posterior automatically balances exploration and exploitation. Arms with high uncertainty produce samples that sometimes beat the current leader, triggering exploration. As uncertainty shrinks, high-reward arms consistently win.
COMPUTATIONAL COST
Sampling from 10-20 Beta distributions takes under 1 millisecond. The argmax over samples is trivial. This fits easily in a 5-10ms latency budget at 100k+ QPS. Storage is minimal: two integers (α, β) per arm. The entire state for 20 arms fits in 160 bytes.
💡 Key Insight: Thompson Sampling has no exploration parameter to tune. The posterior uncertainty automatically drives exploration. This makes it more robust than UCB when you cannot tune parameters carefully.
INFORMATIVE PRIORS
If you expect an arm to have ~10% click rate, start with Beta(3, 27) instead of Beta(1,1). This encodes prior knowledge without requiring observations. Informative priors help when you have domain knowledge or historical data about similar arms.

💡 Key Takeaways

✓Maintains Beta(α, β) posterior per arm; update rule is Beta(α + success, β + failure)

✓Select arm by sampling from each posterior and taking argmax; no exploration parameter needed

✓Submillisecond computation: 10-20 Beta samples plus argmax at 100k+ QPS

✓Informative priors like Beta(3, 27) encode domain knowledge; useful when you expect ~10% CTR

📌 Interview Tips

1Explain the Bayesian update: after seeing a click, Beta(1,1) becomes Beta(2,1); after seeing no-click, Beta(1,2)

2When asked why Thompson over UCB: no exploration parameter to tune, posterior uncertainty handles exploration automatically

3Mention informative priors to show depth: if historical data suggests 10% CTR, start with Beta(3, 27)

← Back to Multi-Armed Bandits (Thompson Sampling, UCB) Overview