Debiasing Through Position Aware Learning (PAL) and Factorization

Position Aware Learning (PAL) tackles position bias by explicitly factorizing the click probability into two independent components: visibility and relevance. The core insight is that p(click|item, position) = p(seen|position, context) × p(click|item, seen). By modeling these separately, you can use only the relevance component for ranking while discarding the position dependent visibility term.

In production implementation, you train two modules. The visibility module learns p(seen|position, context) using exposure versus non exposure signals: viewport visible impressions versus server side insertions, dwell time thresholds, or explicit attention tracking. This module captures the position curve: on mobile, position 1 might have 95% visibility, position 3 drops to 60%, and position 10 falls to 10%. The relevance module learns p(click|item, seen) using only post exposure clicks, effectively asking "given the user saw this item, how relevant is it?"

An alternative additive approach decomposes scores as s(item, position) = f(item) + g(position, context), where f captures intrinsic relevance and g captures position and context specific calibration. You train jointly but use only f for ranking. Google Search and Ads systems use variants of this for per slot calibration: if position 1 on desktop has a +0.5 additive boost to log odds of click and position 5 has a negative 0.3 penalty, you apply these g(position, context) corrections when estimating true relevance or computing auction values, but rank using raw f(item) scores.

The critical implementation detail: during inference, you must either use only the relevance module (factorization approach) or set position to a constant or remove g(position, context) (additive approach). Tree based models are brittle here because they learn discrete splits on position that collapse when position is removed, causing unpredictable ranking changes. Neural networks generalize better when position features are zeroed at inference time because learned representations are smoother.

💡 Key Takeaways

•Factorization approach: p(click|item, position) = p(seen|position, context) × p(click|item, seen). Train visibility and relevance separately, use only relevance for ranking.

•Additive decomposition: s(item, position) = f(item) + g(position, context). Use f for ranking and g for calibration in ads auctions or per slot adjustments.

•Visibility module learns position curves from viewport impressions: position 1 at 95% visibility, position 3 at 60%, position 10 at 10% on mobile surfaces.

•Neural networks generalize better than tree models when position features are removed at inference. Trees learn brittle discrete splits (if position less than 3 then score +2) that collapse unpredictably.

•Google Ads applies per slot calibration curves g(position, context) to correct pCTR for auction ranking, preventing overbidding on top slots inflated by position effects alone.

•Critical failure mode: training with position as feature then zeroing it at inference can cause rank instability. Better to exclude position from relevance tower entirely or use late fusion with careful ablation.

📌 Examples

Netflix model training: Visibility module trained on viewport dwell time (seen if >0.5 seconds in view). Relevance module trained on play rate given seen. At inference, only relevance scores rank titles, avoiding homepage row position leakage.

Pinterest Ads calibration: Additive model s = f(pin, user) + g(slot, device). On mobile feed, slot 1 gets +0.6 log odds, slot 5 gets negative 0.2. Auction ranks by f only; billing uses s with g correction to avoid overcharging for position lift.

Meta News Feed: Dense neural tower for relevance excludes retrieval rank features. Separate shallow network learns per module and position adjustments. Final ranker combines both but exploration slices use relevance only to measure true quality.

← Back to Position Bias & Feedback Loops Overview