Offline vs Online: The Gap Between Training and Reality
Why Offline and Online Diverge
Offline evaluation uses historical labels, but those labels were collected under a previous ranking policy. If the old system never showed certain items, you have no labels for them. Your new model might surface great items that have no historical labels, scoring poorly offline but delighting users online.
Labels also go stale. A product labeled "highly relevant" six months ago might now be out of stock, have bad reviews, or be superseded by better alternatives. Offline metrics say you are improving; online metrics show users bouncing.
Typical Gap Size
Expect 10-30% disagreement between offline and online. If your offline NDCG improves 5%, online CTR might improve 2%, stay flat, or even drop. The correlation is positive but noisy. A model must win offline to be worth testing online, but offline wins do not guarantee online wins.
Track the offline/online correlation over time. If it drops below 0.5 (meaning offline improvements predict online improvements only half the time), your labels need refreshing or your offline setup has systematic bias.
Bridging the Gap
Fresh labels: Re-label samples regularly using current user behavior, not historical judgments. Counterfactual evaluation: Use logged data to estimate what would have happened under a different policy, reducing bias from the logging policy. Holdout sets: Reserve some traffic for random exploration to collect unbiased labels for items the current system never shows.