Thompson Sampling: Bayesian Probability Matching
THE BAYESIAN APPROACH
Thompson Sampling maintains a probability distribution (posterior) over each arm's true reward rate. Instead of computing a single point estimate, it treats the reward rate as uncertain. For binary rewards (click/no-click), the Beta distribution is the natural choice. Each arm starts with Beta(1,1), which is uniform over [0,1], representing complete uncertainty.
HOW IT WORKS
At each decision: (1) Sample one value from each arm's Beta posterior. (2) Select the arm with the highest sample. (3) Observe the reward. (4) Update that arm's posterior: Beta(α, β) → Beta(α + success, β + failure). The posterior automatically balances exploration and exploitation. Arms with high uncertainty produce samples that sometimes beat the current leader, triggering exploration. As uncertainty shrinks, high-reward arms consistently win.
COMPUTATIONAL COST
Sampling from 10-20 Beta distributions takes under 1 millisecond. The argmax over samples is trivial. This fits easily in a 5-10ms latency budget at 100k+ QPS. Storage is minimal: two integers (α, β) per arm. The entire state for 20 arms fits in 160 bytes.
INFORMATIVE PRIORS
If you expect an arm to have ~10% click rate, start with Beta(3, 27) instead of Beta(1,1). This encodes prior knowledge without requiring observations. Informative priors help when you have domain knowledge or historical data about similar arms.