Contextual Bandits: Balancing Exploration and Exploitation
THE EXPLORATION PROBLEM
If you only show items that performed well historically, you never learn about new items or changing preferences. A new product with zero clicks looks worse than an old product with 10% click rate, even if the new one would perform better. Contextual bandits solve this by systematically trying uncertain options to gather information while still showing good items most of the time.
HOW CONTEXTUAL BANDITS WORK
Given context (user features, session state, time of day), the bandit predicts reward (click, purchase) for each possible action (item to show). It then selects an action balancing expected reward against uncertainty. High uncertainty items get explored; high expected reward items get exploited. Over time, uncertainty decreases as the system gathers data.
COMMON ALGORITHMS
Epsilon-greedy: Show the best item 90% of the time, random item 10%. Simple but wastes exploration budget on clearly bad items. Thompson Sampling: Sample from the probability distribution of expected rewards and pick the winner. Naturally explores uncertain items while exploiting confident ones. Upper Confidence Bound (UCB): Pick the item with highest optimistic estimate (expected reward plus uncertainty bonus). All three converge to optimal behavior; Thompson Sampling often works best in practice.
REAL-TIME UPDATES
Unlike batch models, bandits update beliefs after every interaction. User clicks: increase expected reward for that item in that context. User ignores: decrease it. This happens in milliseconds, allowing the system to adapt within a single session.