ML-Powered Search & RankingEvaluation (NDCG, MRR, CTR, Dwell Time)Hard⏱️ ~3 min

Offline vs Online: The Gap Between Training and Reality

Key Insight
Offline metrics (NDCG, MRR) and online metrics (CTR, dwell) often disagree. A model that wins offline may lose online. Understanding this gap is essential for reliable evaluation.

Why Offline and Online Diverge

Offline evaluation uses historical labels, but those labels were collected under a previous ranking policy. If the old system never showed certain items, you have no labels for them. Your new model might surface great items that have no historical labels, scoring poorly offline but delighting users online.

Labels also go stale. A product labeled "highly relevant" six months ago might now be out of stock, have bad reviews, or be superseded by better alternatives. Offline metrics say you are improving; online metrics show users bouncing.

Typical Gap Size

Expect 10-30% disagreement between offline and online. If your offline NDCG improves 5%, online CTR might improve 2%, stay flat, or even drop. The correlation is positive but noisy. A model must win offline to be worth testing online, but offline wins do not guarantee online wins.

Track the offline/online correlation over time. If it drops below 0.5 (meaning offline improvements predict online improvements only half the time), your labels need refreshing or your offline setup has systematic bias.

Bridging the Gap

Fresh labels: Re-label samples regularly using current user behavior, not historical judgments. Counterfactual evaluation: Use logged data to estimate what would have happened under a different policy, reducing bias from the logging policy. Holdout sets: Reserve some traffic for random exploration to collect unbiased labels for items the current system never shows.

⚠️ Rule: Never ship based on offline metrics alone. Offline selects candidates for online testing. Online decides what ships to users.
💡 Key Takeaways
Offline and online metrics often disagree by 10-30%. An offline winner may lose online.
Labels go stale: products become unavailable, reviews change, better alternatives emerge.
Offline labels reflect the old policy. New models surfacing previously unseen items score poorly offline but may delight users.
Track offline/online correlation. Below 0.5 means labels need refreshing.
Never ship on offline alone. Offline selects candidates; online decides what ships.
📌 Interview Tips
1Explain why divergence happens: stale labels, logging policy bias, unseen items without labels.
2Quantify the gap: expect 10-30% disagreement. 5% offline lift might yield 0-2% online lift.
3Describe bridging strategies: fresh labels, counterfactual evaluation, exploration holdouts.
← Back to Evaluation (NDCG, MRR, CTR, Dwell Time) Overview
Offline vs Online: The Gap Between Training and Reality | Evaluation (NDCG, MRR, CTR, Dwell Time) - System Overflow