Production Implementation: Training Pipelines and Serving Architecture for Learning to Rank
The Feature Snapshot Problem
When a user clicks a result, you log (query, item, click). But what features did that item have at click time? If item X had 100 reviews when clicked but now has 500, which value should training use? The answer: 100. The user decided based on seeing 100 reviews. Training on 500 teaches the model that "items with 500 reviews get clicked" when users never saw that. This creates a gap between what the model learns and what it encounters at serving time.
The fix: snapshot feature values at impression time, not lookup time. When you log the click, also log every feature value used to rank that item. Store these snapshots in a columnar format for efficient training reads. This doubles storage costs but eliminates an entire class of bugs.
Serving Latency Budget
Users expect results in under 200ms total. Search and ranking typically get 50ms of that budget at the 99th percentile (meaning 99 of 100 requests must finish in 50ms). Break this into: feature lookup (1-5ms) from a fast key value store, and model scoring (10-30ms). The critical optimization: batch all candidates in one model call. Scoring 100 items individually takes 100 × 10ms = 1 second. Batching takes 15-20ms. The difference is whether your system works or fails.
Safe Model Deployment
Never deploy a new ranking model directly to 100% of traffic. First, run in shadow mode: score every request but do not use the scores. Compare shadow outputs to production. If they differ wildly, investigate. Then ramp gradually: 1% traffic, watch metrics for a day. 5%, another day. 20%, 50%, 100%. At each stage, if click rates or revenue drop more than 2-3%, automatically revert within minutes. This catches bugs that offline evaluation misses.