What Are the Critical Failure Modes in Bias Aware Ranking?
Failure Mode: Impression Logging Errors
Server side impression logging treats every item returned by the API as seen by the user. In reality, users on infinite scroll feeds rarely scroll past the first 5-10 items. If you log all 50 returned items as impressions, 40 of them become false negatives. The model learns these unseen items are irrelevant, even though users never had a chance to consider them. The fix: client side viewability tracking that only logs an impression when at least 50% of the item pixels are visible for at least 1 second.
Failure Mode: Propensity Model Staleness
Propensity estimates are computed from historical data. But user behavior changes over time. New UI layouts change how far users scroll. Mobile versus desktop has different examination patterns. Seasonal changes affect engagement. If you trained propensities on data from 3 months ago, they may no longer reflect current user behavior. A propensity curve showing position 8 at 15% examination might now be 25% after a UI redesign. Using stale propensities means your IPS weights are wrong, reintroducing the bias you tried to remove.
Failure Mode: Population Shift
Propensities are often estimated on all users, but different user segments scroll differently. Power users examine 20 positions. Casual users examine 3. If your traffic mix shifts toward casual users, average examination drops at lower positions. Models trained on power user propensities over correct for casual users (applying too high weights) and under correct for power users. Segment specific propensity estimation helps, but adds complexity and requires enough data per segment.