ML-Powered Search & Ranking • Evaluation (NDCG, MRR, CTR, Dwell Time)Medium⏱️ ~3 min
Online User Behavior Metrics: CTR and Dwell Time
Online metrics capture what real users actually do, revealing whether your ranking satisfies their needs beyond theoretical relevance. While offline metrics tell you if the ranking looks right given labels, online metrics tell you if users engage and are satisfied.
Click Through Rate (CTR) is clicks divided by impressions over a period. For a search baseline CTR of 5 percent, detecting a 5 percent relative lift (0.25 percentage point absolute) requires roughly 120 thousand impressions per experiment arm for 80 percent statistical power. High traffic surfaces like Google Search or Amazon homepage reach this in minutes; low traffic surfaces may need days. The critical tradeoff is that CTR measures attractiveness and presentation, not just relevance. It suffers from strong position bias (top results get more clicks regardless of quality) and is easily gamed by clickbait titles or misleading thumbnails. A model that moves genuinely relevant but visually unattractive items higher may see CTR drop despite improving user satisfaction.
Dwell time is the duration a user spends after clicking before returning to the surface or abandoning. It acts as a quality proxy: short dwell (less than 3 seconds) often signals dissatisfaction or accidental clicks, while long dwell (greater than 10 seconds) suggests satisfaction or content consumption. LinkedIn has publicly discussed using dwell time buckets in Feed ranking to distinguish accidental impressions from genuine engagement. YouTube optimizes watch time and session duration rather than raw clicks, since higher CTR without completion hurts satisfaction. Netflix emphasizes member viewing minutes and retention over click rates.
The challenge with dwell time is ambiguity: longer dwell can mean slow page loads or user confusion, while short dwell can mean instant satisfaction for quick answer queries. Mature systems condition dwell metrics by task type, device, and query intent. For feed items, buckets like short (less than 3 seconds) and long (greater than 10 seconds) work well. For videos, long dwell might be defined as watching a meaningful fraction of duration. Amazon Search reports a mix of CTR, add to cart rate, and conversion rate, recognizing that clicks without purchase signal misalignment.
💡 Key Takeaways
•CTR measures attractiveness but suffers from position bias and presentation bias; clickbait can inflate CTR while harming satisfaction, requiring guardrail metrics
•Detecting a 5 percent relative CTR lift from a 5 percent baseline needs approximately 120 thousand impressions per arm at 80 percent power; high QPS surfaces reach this in minutes
•Dwell time is closer to satisfaction for consumption tasks but is confounded by page latency, task type, and device; short dwell can mean instant satisfaction for quick answers
•Dwell buckets like short (less than 3 seconds) and long (greater than 10 seconds) for feeds or meaningful fraction of duration for videos stabilize noisy data and reduce outlier sensitivity
•Mature systems use multi metric optimization with primary and guardrail metrics: ship only if CTR improves and session depth does not drop more than 0.2 percent, preventing local maxima
📌 Examples
LinkedIn Feed ranking defines accidental impression thresholds and distinguishes short versus long dwell to correlate with member satisfaction surveys
YouTube optimizes watch time and session duration over raw CTR, since clicks without completion indicate thumbnail clickbait that hurts long term retention
Amazon Search tracks CTR, add to cart rate, and conversion rate together; high CTR with low conversion signals misalignment between search results and purchase intent
Netflix emphasizes viewing minutes and 28 day retention over click rates, as higher CTR without content completion reduces member satisfaction