Recommendation SystemsReal-time Personalization (Session-based, Contextual Bandits)Medium⏱️ ~3 min

Contextual Bandits: Balancing Exploration and Exploitation

THE EXPLORATION PROBLEM

If you only show items that performed well historically, you never learn about new items or changing preferences. A new product with zero clicks looks worse than an old product with 10% click rate, even if the new one would perform better. Contextual bandits solve this by systematically trying uncertain options to gather information while still showing good items most of the time.

HOW CONTEXTUAL BANDITS WORK

Given context (user features, session state, time of day), the bandit predicts reward (click, purchase) for each possible action (item to show). It then selects an action balancing expected reward against uncertainty. High uncertainty items get explored; high expected reward items get exploited. Over time, uncertainty decreases as the system gathers data.

COMMON ALGORITHMS

Epsilon-greedy: Show the best item 90% of the time, random item 10%. Simple but wastes exploration budget on clearly bad items. Thompson Sampling: Sample from the probability distribution of expected rewards and pick the winner. Naturally explores uncertain items while exploiting confident ones. Upper Confidence Bound (UCB): Pick the item with highest optimistic estimate (expected reward plus uncertainty bonus). All three converge to optimal behavior; Thompson Sampling often works best in practice.

💡 Key Insight: Bandits are not about finding the single best item. They continuously adapt as user preferences and item performance change.

REAL-TIME UPDATES

Unlike batch models, bandits update beliefs after every interaction. User clicks: increase expected reward for that item in that context. User ignores: decrease it. This happens in milliseconds, allowing the system to adapt within a single session.

💡 Key Takeaways
Bandits explore uncertain options while exploiting known good items to avoid missing better alternatives
Given context, predict reward for each item, then balance expected reward against uncertainty
Epsilon-greedy (10% random), Thompson Sampling (sample from distributions), UCB (optimistic estimates)
Thompson Sampling often works best in practice; all three converge to optimal behavior
Update beliefs after every interaction in milliseconds, adapting within a single session
📌 Interview Tips
1Explain the new item problem: zero clicks looks worse than 10% CTR, but might be better
2Compare algorithms: epsilon-greedy wastes exploration, Thompson Sampling explores intelligently
3Discuss real-time updates: click increases expected reward, ignore decreases it, all in milliseconds
← Back to Real-time Personalization (Session-based, Contextual Bandits) Overview
Contextual Bandits: Balancing Exploration and Exploitation | Real-time Personalization (Session-based, Contextual Bandits) - System Overflow