ML-Powered Search & RankingEvaluation (NDCG, MRR, CTR, Dwell Time)Hard⏱️ ~2 min

Evaluation Pitfalls: Logging Errors, Distribution Shift, and Guardrails

Key Insight
Evaluation fails silently when logging is broken, distributions shift, or guardrails are missing. These pitfalls corrupt metrics without obvious errors.

Logging Bugs That Corrupt Metrics

Metrics depend on accurate logging. Common bugs: duplicate events (inflating impressions or clicks), missing events (mobile app fails to log), timestamp misalignment (click logged before impression), and sampling errors (1% sample is not representative). A logging bug that doubles impressions cuts your measured CTR in half. You think ranking got worse; actually logging broke.

Always validate logging: compare client-side and server-side counts, check for duplicates, verify timestamps are ordered correctly. Run sanity checks: does total click count roughly match revenue? Do impression counts match load balancer traffic?

Distribution Shift

Models are trained on past data but serve future traffic. If the distribution shifts, metrics become unreliable. Seasonal shifts (holiday shopping vs normal), population shifts (new user demographics), and query shifts (trending topics) all change what good rankings look like.

A model trained on desktop users may fail on mobile. A model trained on US users may fail in new markets. Monitor segment-level metrics, not just aggregates. A flat overall CTR might hide a 20% drop in mobile offset by 10% gain in desktop.

Guardrail Metrics

Primary metrics (NDCG, CTR) tell you if ranking improved. Guardrail metrics catch unintended harm: revenue per session, diversity of results shown, coverage of catalog, latency percentiles. A model that improves CTR 3% but crashes revenue 10% should not ship. Define guardrails before experiments and never violate them for metric gains.

⚠️ Rule: Before trusting any metric movement, ask: is logging correct? Has distribution shifted? Are guardrails passing?
💡 Key Takeaways
Logging bugs corrupt metrics silently: duplicates inflate impressions, missing events hide clicks, timestamps misalign.
Validate logging: compare client/server counts, check for duplicates, sanity-check against revenue/traffic.
Distribution shift makes past data unreliable: seasonal, demographic, and query patterns change.
Monitor segment-level metrics. Flat aggregates can hide 20% drop in one segment offset by gains elsewhere.
Guardrails (revenue, diversity, latency) catch unintended harm. Never violate for primary metric gains.
📌 Interview Tips
1Give logging bug example: duplicated impressions cut measured CTR in half. Looks like ranking regressed; actually logging broke.
2Explain distribution shift with segments: desktop gains masking mobile losses in flat aggregate.
3Describe guardrails: CTR +3% but revenue -10% should not ship. Define guardrails before experiments.
← Back to Evaluation (NDCG, MRR, CTR, Dwell Time) Overview