Learn→Recommendation Systems→Diversity & Exploration (Multi-armed Bandits)→3 of 6

Recommendation Systems • Diversity & Exploration (Multi-armed Bandits)Medium⏱️ ~3 min

Production Architecture: Sampler, Parameter Store, and Streaming Feedback

Core Concept
Contextual bandits extend multi-armed bandits by incorporating context (user features, item features, time) into the decision. The reward depends not just on which item you show, but on the context in which you show it.
Why Context Matters
An item that works well for one user segment may fail for another. Morning recommendations differ from evening. Mobile differs from desktop. Contextual bandits learn these variations. Instead of one reward estimate per item, they learn a function mapping (context, item) to reward.
LinUCB Algorithm
Model reward as linear function of context: reward = context · weights + noise. For each item, maintain weight vector and uncertainty estimate. Select item with highest UCB: predicted reward plus exploration bonus scaled by uncertainty. Update weights after observing actual reward.
Neural Contextual Bandits
Replace linear model with neural network for more expressive reward prediction. Uncertainty estimation becomes harder. Common approaches: dropout-based uncertainty, ensemble methods, or neural tangent kernel approximations. More powerful but more complex to implement and tune.
💡 Key Trade-off: Contextual bandits require more data than context-free bandits because they learn separate reward functions per context. With sparse data per context, they may underperform simpler approaches. Use when contexts have distinct reward patterns and you have enough data per context.

💡 Key Takeaways

✓Sampler must add less than 10ms latency. Thompson Sampling or UCB is O(1) computation per arm. Cache frequently accessed arm statistics in local memory.

✓Parameter store needs high-throughput point reads with p99 under 5ms. State per arm is tiny: two integers for Beta-Bernoulli. Use Redis with atomic INCR or similar.

✓Feedback pipeline is fully async and decoupled from serving. Events flow through message queue, get filtered and validated, then update parameter store. No blocking on serving path.

✓Convergence depends on traffic volume and arm similarity. With 1000 impressions per day per arm, typical bandits converge in 1-2 weeks. Monitor arm selection frequencies for convergence signals.

✓Three-component separation (sampler, store, feedback) enables independent scaling and deployment. Algorithm changes do not require feedback infrastructure changes.

📌 Interview Tips

1When asked about bandit architecture: explain the three-component separation - sampler (samples from posteriors, <10ms), parameter store (maintains arm statistics), feedback pipeline (updates posteriors).

2For scalability: mention that parameter state per arm is tiny (two integers for Beta-Bernoulli), enabling thousands of bandits (per-position, per-segment) with minimal memory.

3When discussing latency: explain that sampler adds <10ms to request path; heavier computation (posterior updates, convergence monitoring) happens asynchronously in feedback pipeline.

← Back to Diversity & Exploration (Multi-armed Bandits) Overview