Evaluation Pitfalls: Logging Errors, Distribution Shift, and Guardrails
Logging Bugs That Corrupt Metrics
Metrics depend on accurate logging. Common bugs: duplicate events (inflating impressions or clicks), missing events (mobile app fails to log), timestamp misalignment (click logged before impression), and sampling errors (1% sample is not representative). A logging bug that doubles impressions cuts your measured CTR in half. You think ranking got worse; actually logging broke.
Always validate logging: compare client-side and server-side counts, check for duplicates, verify timestamps are ordered correctly. Run sanity checks: does total click count roughly match revenue? Do impression counts match load balancer traffic?
Distribution Shift
Models are trained on past data but serve future traffic. If the distribution shifts, metrics become unreliable. Seasonal shifts (holiday shopping vs normal), population shifts (new user demographics), and query shifts (trending topics) all change what good rankings look like.
A model trained on desktop users may fail on mobile. A model trained on US users may fail in new markets. Monitor segment-level metrics, not just aggregates. A flat overall CTR might hide a 20% drop in mobile offset by 10% gain in desktop.
Guardrail Metrics
Primary metrics (NDCG, CTR) tell you if ranking improved. Guardrail metrics catch unintended harm: revenue per session, diversity of results shown, coverage of catalog, latency percentiles. A model that improves CTR 3% but crashes revenue 10% should not ship. Define guardrails before experiments and never violate them for metric gains.