Failure Modes: Propensity Errors, Format Changes, and Delayed Loops

PROPENSITY ESTIMATION ERRORS
If propensity estimates are wrong, IPS makes things worse. Common causes: using production model propensity when the actual model was different (training serving skew), not accounting for position randomization policy, ignoring user level personalization in propensity calculation. Validate propensity by comparing estimated versus empirical distribution of impressions.
DISPLAY FORMAT CHANGES
Position bias curves change when display format changes. Moving from a list to a grid changes which positions get attention. Adding a carousel above the main list shifts all position curves down. If you apply an old position model to a new format, debiasing is wrong. Remeasure position curves after any UI change and retrain position models.
⚠️ Warning: Mobile and desktop have different position bias curves. A model trained on desktop data will misbehave on mobile traffic. Segment by device type.
EXPLORATION GONE WRONG
Too much exploration (over 10%) visibly hurts user experience and triggers complaints. Too little (under 1%) leaves you blind. Unbalanced exploration (always exploring the same item types) creates new biases. Monitor exploration coverage: are all item categories getting explored proportionally? Is exploration distributed across user segments?
DELAYED FEEDBACK LOOPS
Some feedback loops take months to manifest. The model slowly narrows its recommendations, but daily metrics look fine. By the time engagement drops, the problem is severe. Track catalog coverage over 90 day windows. If coverage trends down consistently, you have a slow feedback loop even if daily metrics are stable.

💡 Key Takeaways

✓Wrong propensity makes IPS worse: validate by comparing estimated vs empirical impression distribution

✓UI changes invalidate position curves: list to grid, adding carousel, all require remeasurement

✓Mobile and desktop have different position bias - segment by device type

✓Exploration 10%+ hurts UX visibly; under 1% leaves you blind; monitor category coverage

✓Slow feedback loops take months: track 90-day catalog coverage trends even when daily metrics look fine

📌 Interview Tips

1Describe training-serving skew: production model changed but training still uses old propensity

2Explain format change: list to grid shifts attention from position 5 to position 6

3Discuss delayed detection: daily engagement stable but 90-day catalog coverage dropping 2% per month

← Back to Position Bias & Feedback Loops Overview