Recommendation SystemsReal-time Personalization (Session-based, Contextual Bandits)Medium⏱️ ~3 min

What are Contextual Bandits in Real-Time Personalization?

Contextual bandits are a one step reinforcement learning approach that solves a fundamental problem in personalization: when you show a user one recommendation, you only learn whether THAT item was good, not whether the alternatives would have been better. This is called the partial feedback problem. The system observes context (user features, session signals, time of day), chooses one action from a set of candidates (which video to recommend, which creative to show), and immediately receives a reward for only that chosen action (click, dwell time, conversion). The core logged tuple is (context x, chosen action a, propensity p, reward r). The propensity p is critical: it's the probability you used to select action a. Without accurate p, you cannot do unbiased offline policy evaluation later. Yahoo reported 12 to 20 percent relative Click Through Rate (CTR) lift on their front page using LinUCB contextual bandits compared to non contextual baselines, while keeping latency under 100ms. Contextual bandits sit between A/B testing and full reinforcement learning. A/B testing finds one global winner slowly (weeks of even traffic splits). Full RL handles multi step decisions but requires complex credit assignment and struggles with stability. Contextual bandits assume your action does not materially affect future states beyond immediate reward, making them perfect for high throughput, low latency decisions like which row to show on Netflix homepage or which notification variant to send. In production, bandits typically run as a thin decision layer. A candidate generator produces 5 to 50 eligible options using heavier models. The bandit then scores these candidates using lightweight models (linear scorers or small neural networks) and picks one, balancing exploitation (choose the best known option) with exploration (try uncertain options to learn). Typical latency budget: 5 to 20ms for the bandit policy call within a 50 to 150ms total page render budget.
💡 Key Takeaways
Contextual bandits solve partial feedback: you only see reward for the displayed option, not alternatives. Propensity p enables unbiased learning from this biased data.
Production latency budgets: 5 to 20ms p95 for policy decision, 50 to 150ms total UI render. Yahoo achieved 12 to 20 percent CTR lift within these constraints.
Action set size typically 5 to 50 candidates per request after upstream filtering. Keeps decision latency predictable and models lightweight.
Session based context emphasizes recent signals: last N clicks, scrolls, dwell time, referrer, device, locale, time of day rather than only long term user profiles.
Trade off versus A/B testing: bandits give faster learning and per context personalization but require propensity logging infrastructure and more complex analysis.
Common throughput: 10,000 to 100,000+ decisions per second per service, billions of logged decisions per day across fleet at companies like Spotify and Meta.
📌 Examples
Netflix uses bandits for artwork and row selection decisions where a small set of alternatives must be chosen per session. Decision layer runs in under 100ms p95 on top of heavier candidate generators.
Microsoft Personalizer (Vowpal Wabbit) provides contextual bandit service supporting sub 50ms p95 latency for inline UI, handling thousands to tens of thousands of rank calls per second with action sets of 5 to 50.
Spotify home page allocation uses contextual bandits to choose which shelf or row variant to show per session, with server side decisioning under 20 to 50ms p95 and tens of billions of decisions daily.
← Back to Real-time Personalization (Session-based, Contextual Bandits) Overview