A/B Testing & Experimentation • Multi-Armed Bandits (Thompson Sampling, UCB)Medium⏱️ ~2 min
Thompson Sampling: Bayesian Probability Matching
Thompson Sampling takes a fundamentally different approach by maintaining a probability distribution over each arm's true reward parameter. On every decision, it samples one value from each arm's posterior distribution, then selects the arm with the highest sampled value. This elegant procedure naturally balances exploration and exploitation: arms with uncertain posteriors produce varied samples, causing exploration, while arms with tight posteriors around high means get selected consistently.
For binary rewards like click or no click, the classic implementation uses a Beta prior for each arm. Start with Beta(1, 1), which is uniform over the interval zero to one. After observing s successes and f failures for an arm, the posterior becomes Beta(1 plus s, 1 plus f). Sampling from a Beta distribution is fast, making Thompson Sampling practical at high throughput. For a commerce site handling 120 thousand requests per second selecting among 10 hero banners, each decision requires sampling 10 Beta distributions and taking the argmax, which is submillisecond work on a single CPU core.
Thompson Sampling adapts exploration automatically through posterior uncertainty. A new arm with Beta(1, 1) has high variance and will occasionally produce high samples, ensuring it gets tried. After 100 observations, the posterior tightens and the arm only gets selected if its mean is competitive. This removes the need to tune an exploration parameter like UCB's c. LinkedIn and Netflix have deployed contextual Thompson Sampling variants where instead of a simple Beta per arm, you maintain a posterior over model weights that predict reward from context features. Research on Neural Linear methods shows that pairing a pretrained neural feature extractor with a Bayesian linear last layer provides stable online learning for complex contexts.
💡 Key Takeaways
•Maintains Beta(alpha, beta) posterior per arm for binary rewards, updated to Beta(alpha plus successes, beta plus failures) after observations
•No explicit exploration parameter needed, posterior uncertainty automatically drives exploration through sampling variance
•Submillisecond per decision cost for 10 arms at 120k requests per second, requiring only 10 Beta samples and argmax operation per request
•Contextual variants maintain posterior over model weights, with Neural Linear pattern combining frozen neural features and Bayesian linear layer
•Works well with informative priors that encode domain knowledge, such as Beta(3, 27) for an arm expected to have roughly 10 percent click rate
📌 Examples
Hero banner selection with Beta Bernoulli Thompson Sampling where each banner starts with Beta(1, 1) and updates after every impression and click
LinkedIn feed ranking using contextual Thompson with posterior over linear weights that map user and item features to engagement probability
Netflix artwork personalization with Neural Linear approach: ResNet features frozen, Bayesian linear layer updated online to adapt to member preferences