Failure Modes: Propensity Errors, Format Changes, and Delayed Loops
PROPENSITY ESTIMATION ERRORS
If propensity estimates are wrong, IPS makes things worse. Common causes: using production model propensity when the actual model was different (training serving skew), not accounting for position randomization policy, ignoring user level personalization in propensity calculation. Validate propensity by comparing estimated versus empirical distribution of impressions.
DISPLAY FORMAT CHANGES
Position bias curves change when display format changes. Moving from a list to a grid changes which positions get attention. Adding a carousel above the main list shifts all position curves down. If you apply an old position model to a new format, debiasing is wrong. Remeasure position curves after any UI change and retrain position models.
EXPLORATION GONE WRONG
Too much exploration (over 10%) visibly hurts user experience and triggers complaints. Too little (under 1%) leaves you blind. Unbalanced exploration (always exploring the same item types) creates new biases. Monitor exploration coverage: are all item categories getting explored proportionally? Is exploration distributed across user segments?
DELAYED FEEDBACK LOOPS
Some feedback loops take months to manifest. The model slowly narrows its recommendations, but daily metrics look fine. By the time engagement drops, the problem is severe. Track catalog coverage over 90 day windows. If coverage trends down consistently, you have a slow feedback loop even if daily metrics are stable.