Production Architecture: Pipelines, Serving, and Evaluation
REAL-TIME FEATURE PIPELINE
Session features must be computed and served within the latency budget. A typical architecture streams user events to a feature store, computes aggregates (last 5 viewed categories, time since last click, session length), and serves them with single digit millisecond latency. The feature store keeps a sliding window of recent events per user, typically the last 50 to 100 actions or last 30 minutes.
MODEL SERVING ARCHITECTURE
For session models, inference happens on every action. Keep the model hot in memory, use batch prediction for candidates, and cache embeddings aggressively. For bandits, maintain a context to action mapping that updates after every reward signal. Thompson Sampling requires sampling from posterior distributions; precompute samples for common contexts to reduce latency.
OFFLINE POLICY EVALUATION
Before deploying a new policy, estimate its performance using historical logs. Inverse propensity scoring reweights past rewards by how likely the new policy would have taken the same action. If the new policy would have shown item X in 50% of cases but the old policy showed it in 10%, multiply the reward by 5. This gives unbiased estimates without running A/B tests.
GRADUAL ROLLOUT
Deploy new policies to 1% of traffic first. Monitor click through rate, conversion, and revenue per session. If metrics are stable after 24 hours, increase to 10%, then 50%, then 100%. This limits damage from bugs or unexpected behavior.