Recommendation Systems • Real-time Personalization (Session-based, Contextual Bandits)Hard⏱️ ~3 min
Architecture and Implementation Patterns for Production Bandits
Production contextual bandit systems follow a layered architecture. The candidate generator or upstream ranker produces a small eligible set (5 to 50 actions) using heavier models (collaborative filtering, embeddings, business rules). This keeps the bandit's action space manageable and latency low. The bandit policy service is a thin, stateless layer that assembles context features, scores actions with a lightweight model (linear or small multilayer perceptron), applies exploration, and returns the chosen action plus propensity. Target latency: 5 to 20ms p95 for the policy call.
Context assembly pulls session features (last N interactions, device, locale), precomputed user embeddings from cache (Redis or in memory), and precomputed action features (item metadata, embeddings) from a feature store. Avoid fanout: fetch action features in batch or precompute per action scores offline and refresh every 5 to 15 minutes. For linear models, maintain sufficient statistics (A and b matrices for LinUCB) in shared memory or distributed cache; updates are O(d squared) per action and feasible for d under 100. For neural scorers, serve via TensorFlow Serving or ONNX Runtime with batching; add 5 to 10ms latency but support more expressive models.
Logging and reward collection use a decision ID to join the decision event (context x, action a, propensity p, timestamp, model version) with reward events (click, dwell, conversion) arriving asynchronously. Store in a replayable event stream (Kafka, Kinesis) for offline policy evaluation and model retraining. Support multiple reward signals and windows: immediate (10 second click), short term (5 minute dwell), long term (24 hour conversion). Online learning systems consume the reward stream and update model state in near real time (seconds to minutes) or micro batches (5 to 15 minute aggregation).
Governance includes a baseline holdout (5 to 10 percent traffic on a fixed policy for ground truth comparison), guardrail monitors on key metrics (reward rate, diversity, long term KPIs), and circuit breakers that revert to baseline if metrics drop below thresholds. Replay and offline policy evaluation run daily: compare candidate policies using Inverse Propensity Scoring (IPS) or Doubly Robust estimators, promote if lift is significant and guardrails pass. Typical fleet scale: tens of thousands of decisions per second per service, sharded by user or session key, with stateless policy servers pulling model state from a registry.
💡 Key Takeaways
•Layered architecture: candidate generator (heavier models) produces 5 to 50 eligible actions, bandit policy service (lightweight) chooses one in 5 to 20ms p95. Avoids large action space fanout.
•Context assembly: session features (last N interactions, device), cached user embeddings, precomputed action features. Batch fetch or refresh every 5 to 15 minutes to avoid per request fanout.
•Linear models: maintain A (covariance) and b (reward) matrices per action, O(d squared) updates, feasible in memory for d under 100. Microsecond inference. Neural scorers: add 5 to 10ms via TensorFlow Serving or ONNX Runtime.
•Logging decision ID joins decision event (x, a, p, model version) with asynchronous reward events (click, dwell, conversion). Store in Kafka or Kinesis for replay and offline policy evaluation.
•Governance: 5 to 10 percent baseline holdout, guardrail monitors on reward rate and diversity, circuit breakers revert to baseline if metrics drop. Daily offline policy evaluation promotes candidates if lift is significant.
•Typical scale: 10,000 to 100,000+ decisions per second per service, billions per day fleet wide. Shard by user or session key, stateless policy servers, model state in registry or distributed cache.
📌 Examples
Microsoft Personalizer architecture: stateless policy servers pull model from Azure Blob, serve via REST API in sub 50ms p95. Logs to Event Hubs, online learner consumes and updates model every 10 minutes, publishes to blob.
Spotify bandit stack: candidate generator (collaborative filtering) produces 50 shelves, bandit policy service (linear scorer, epsilon greedy ε=0.03) chooses 6 in 15ms p95. Logs to Kafka, Flink job aggregates rewards, updates model every 10 minutes.
Netflix row selection: candidate rows from ranker (50 options), session features from Redis (128d user embedding, last 5 interactions), bandit chooses 1 row in 8ms p95. Daily OPE on past week, promotes new policy if IPS lift exceeds 2 percent and skip rate guardrail passes.