ML-Powered Search & RankingEvaluation (NDCG, MRR, CTR, Dwell Time)Hard⏱️ ~3 min

Production Scale Ranking: Latency Budgets and Logging Pipelines

Production search and recommendation systems operate under strict latency constraints while serving massive query volumes. Large properties serve 50 thousand to 200 thousand ranking requests per second across regions, with P95 latency budgets of 100 to 200 milliseconds for web search and under 50 milliseconds for autocomplete. Every millisecond counts when ranking thousands of candidates per request. The ranking flow starts with candidate retrieval pulling tens to thousands of items in milliseconds, followed by feature extraction and model scoring within the latency budget. Each request logs impression events per item with position, item identifier, features, and a stable request identifier. Later, click events and dwell measurements are joined back to impressions using the request identifier. This creates delayed signals: clicks arrive within seconds, but dwell time requires waiting for return or abandonment events, and late events can trickle in for hours due to network delays or offline mobile sessions. Pipelines reconcile late events using watermarking and backfilling. A typical setup maintains a 24 hour late event handling window with idempotent processing to handle duplicates. Streaming analytics compute approximate CTR and long dwell rates with minute level latency to detect regressions quickly, while batch jobs compute precise metrics daily with full late event reconciliation. On mobile, backgrounding and app switches complicate dwell measurement; systems use periodic heartbeats and app lifecycle events to approximate reading or viewing time. Instrumentation must be robust to logging bugs and traffic anomalies. Missing impression events inflate CTR by shrinking the denominator. Incorrect experiment bucketing causes sample ratio mismatch, where treatment and control groups have unequal traffic despite randomization, biasing estimates. Mature systems detect sample ratio mismatch automatically and halt experiments when observed traffic deviates from expected allocation by more than a threshold. Bots and spam inflate clicks, so anomaly detection, user level caps, and robust metrics like long click rate (clicks with dwell greater than a threshold) filter noise. The complete loop connects ranking, logging, metrics, and experimentation. Models produce rankings with sub 200 millisecond latency at tens of thousands of queries per second. Logging captures impressions and delayed engagement signals. Streaming and batch pipelines compute metrics per cohort and experiment arm. A/B tests measure online impact with statistical rigor. Offline metrics guide fast iteration cycles, while online metrics validate real user value and guard against gaming or misalignment. Historical launch data builds correlation maps between offline and online deltas, which become launch gates to avoid expensive A/B tests on models unlikely to improve online metrics.
💡 Key Takeaways
Production systems serve 50 thousand to 200 thousand ranking QPS with P95 latency under 100 to 200 milliseconds for search and under 50 milliseconds for autocomplete, requiring efficient scoring
Impression and click events are logged with stable request identifiers for joining; dwell time is delayed and requires 24 hour watermark windows to handle late events from network delays and offline sessions
Streaming pipelines compute approximate CTR with minute latency for fast regression detection; batch pipelines compute precise metrics daily with full late event reconciliation and idempotent processing
Sample ratio mismatch from incorrect bucketing biases A/B tests; systems automatically halt experiments when observed traffic allocation deviates significantly from randomization
Logging bugs like missing impression events inflate CTR; bots and spam distort clicks, requiring anomaly detection, user level caps, and robust metrics like long click rate (dwell greater than threshold)
📌 Examples
Google Search ranking serves hundreds of thousands of QPS globally with sub 200 millisecond P95 latency, logging impressions per result with position and features for offline model training
Amazon uses 24 hour late event handling for mobile app sessions where users go offline after clicking, backfilling dwell time when the app reconnects
LinkedIn Feed detects sample ratio mismatch by comparing observed treatment versus control traffic to expected 50/50 split, halting experiments when chi squared test shows significant deviation
Meta uses periodic heartbeats every few seconds on mobile to approximate dwell time even when users background the app or lose connectivity
← Back to Evaluation (NDCG, MRR, CTR, Dwell Time) Overview