ML-Powered Search & RankingEvaluation (NDCG, MRR, CTR, Dwell Time)Hard⏱️ ~3 min

Tradeoffs: Offline vs Online, CTR vs Dwell, Single vs Multi Metric

Evaluation tradeoffs shape how teams iterate and what behavior they optimize. Each metric dimension surfaces different aspects of quality and risk, requiring careful balance for production systems. Offline versus online evaluation presents a speed versus fidelity tradeoff. Offline metrics using historical labels are fast, reproducible, and safe. Teams can iterate on model architectures and features with no risk to users, running hundreds of experiments per week. The cost is potential misalignment: labels may be stale, biased by the logging policy, or reflect only one intent type. A model trained on informational query labels may hurt navigational query CTR by demoting official homepages. Online evaluation captures real behavior, emergent effects like changes in exploration, and all intent types, but it is slower, riskier, and requires traffic, guardrails, and proper experimentation infrastructure. Mature teams use offline metrics to filter out obviously bad models, then validate top candidates online. CTR versus dwell time captures attractiveness versus satisfaction. CTR has strong position bias: top ranked items get more clicks regardless of quality. It is easily gamed by clickbait titles, misleading thumbnails, or sensational presentation. A model optimizing only CTR may increase clicks but decrease long term retention if users feel misled. Dwell time is closer to true satisfaction for consumption tasks, but it is confounded by page latency, task type, and device. For navigational queries, short dwell can be a success if users instantly find the link they need. For reading or watching tasks, long dwell indicates value. Systems must condition dwell buckets by query intent and task to avoid optimizing the wrong behavior. Single metric versus multi metric optimization addresses the risk of local maxima and unintended consequences. Optimizing only CTR can hurt long term retention by favoring clickbait. Optimizing only dwell can reduce throughput or favor long content that does not convert. Mature systems set a primary metric and constraints for others. For example, ship only if NDCG@10 improves by at least 0.5 percent on head queries AND online session depth does not drop more than 0.2 percent. E commerce often uses conversion rate and revenue as primary metrics with CTR as diagnostic. This multi metric approach prevents shipping models that game one metric at the expense of overall user value or business outcomes. The complete picture requires calibration between offline and online. Teams maintain repositories of past launches with offline metric deltas and online outcomes, computing segment level correlations. For example, a 1 percent NDCG@10 improvement on informational queries might correlate with a 0.3 percent CTR lift and a 0.5 percent long dwell lift. These correlations become launch gates: minimum offline gains per segment before running expensive A/B tests. Gates are updated quarterly as user behavior and product surfaces evolve, ensuring alignment stays current.
💡 Key Takeaways
Offline evaluation is fast and reproducible but risks misalignment from stale or biased labels; online evaluation captures real behavior but is slower and requires traffic and guardrails
CTR measures attractiveness with strong position bias and is easily gamed by clickbait; dwell time is closer to satisfaction but confounded by latency and task type, requiring conditioning by query intent
Single metric optimization creates local maxima: CTR only optimization can hurt retention, dwell only optimization can reduce throughput; mature systems use primary metrics with guardrail constraints
Multi metric launch gates like ship only if NDCG@10 improves at least 0.5 percent AND session depth does not drop more than 0.2 percent prevent gaming one dimension
Teams track correlation between offline and online deltas per segment, updating launch gates quarterly to reflect evolving alignment as user behavior and surfaces change
📌 Examples
Amazon found that a model improving NDCG on labeled data hurt navigational CTR by 2 percent because it moved official brand pages down, requiring segment specific gates
YouTube optimizes watch time and session duration rather than raw CTR to avoid favoring clickbait thumbnails that increase clicks but reduce completion rate
Google maintains minimum NDCG@10 thresholds per query type (informational, navigational, transactional) before allowing online A/B tests, avoiding wasted experimentation on weak models
LinkedIn uses multi metric gates: new Feed ranking models must improve long dwell rate without reducing session depth by more than a threshold, balancing engagement depth and breadth
← Back to Evaluation (NDCG, MRR, CTR, Dwell Time) Overview
Tradeoffs: Offline vs Online, CTR vs Dwell, Single vs Multi Metric | Evaluation (NDCG, MRR, CTR, Dwell Time) - System Overflow