Learn→Recommendation Systems→Real-time Personalization (Session-based, Contextual Bandits)→6 of 6

Recommendation Systems • Real-time Personalization (Session-based, Contextual Bandits)Hard⏱️ ~3 min

Production Architecture: Pipelines, Serving, and Evaluation

REAL-TIME FEATURE PIPELINE
Session features must be computed and served within the latency budget. A typical architecture streams user events to a feature store, computes aggregates (last 5 viewed categories, time since last click, session length), and serves them with single digit millisecond latency. The feature store keeps a sliding window of recent events per user, typically the last 50 to 100 actions or last 30 minutes.
MODEL SERVING ARCHITECTURE
For session models, inference happens on every action. Keep the model hot in memory, use batch prediction for candidates, and cache embeddings aggressively. For bandits, maintain a context to action mapping that updates after every reward signal. Thompson Sampling requires sampling from posterior distributions; precompute samples for common contexts to reduce latency.
✅ Best Practice: Log everything needed for offline evaluation: context features, action taken, position shown, propensity score, and eventual reward. Without complete logs, you cannot improve the system.
OFFLINE POLICY EVALUATION
Before deploying a new policy, estimate its performance using historical logs. Inverse propensity scoring reweights past rewards by how likely the new policy would have taken the same action. If the new policy would have shown item X in 50% of cases but the old policy showed it in 10%, multiply the reward by 5. This gives unbiased estimates without running A/B tests.
GRADUAL ROLLOUT
Deploy new policies to 1% of traffic first. Monitor click through rate, conversion, and revenue per session. If metrics are stable after 24 hours, increase to 10%, then 50%, then 100%. This limits damage from bugs or unexpected behavior.

💡 Key Takeaways

✓Stream events to feature store, compute aggregates (last 5 categories, session length), serve in single-digit ms

✓Keep models hot in memory, batch predict candidates, cache embeddings for session models

✓Log context, action, position, propensity, and reward - incomplete logs make improvement impossible

✓Offline policy evaluation with inverse propensity scoring estimates new policy performance without A/B tests

✓Gradual rollout: 1% → 10% → 50% → 100% over 24-hour intervals to limit damage from bugs

📌 Interview Tips

1Describe feature pipeline: sliding window of last 50-100 actions or 30 minutes per user

2Explain propensity reweighting: new policy 50% vs old policy 10% means multiply reward by 5

3Walk through gradual rollout: 1% for 24 hours, check CTR and revenue, then expand

← Back to Real-time Personalization (Session-based, Contextual Bandits) Overview