ML-Powered Search & Ranking • Relevance Feedback (Click Models, Position Bias)Hard⏱️ ~3 min
What is Inverse Propensity Scoring and When Does It Fail?
Inverse Propensity Scoring (IPS) is a counterfactual learning technique that reweights training examples by the inverse probability that each example would have been observed. If an item at position 8 has only a 10 percent chance of being seen, and it was clicked, that click gets a weight of 1 divided by 0.10 equals 10. This corrects for position bias by amplifying rare observations.
IPS is statistically unbiased if propensities are known exactly. In practice, you estimate propensities from randomized exploration cohorts. For example, run RandTopN shuffling on 2 percent of traffic at 20,000 queries per second (QPS). This yields 400 QPS of randomized traffic, or about 34 million impressions per day. Fit a parametric curve p(seen at position and context) per surface and device. Apply these weights during training: weight equals 1 divided by p(seen).
The critical failure mode is variance explosion. Rare positions or contexts can have very low propensities, creating weights of 50 or 100 or more. These large weights cause unstable gradients and high noise in stochastic gradient descent. A single mislabeled example with weight 100 can dominate a minibatch. The effective sample size, defined as the square of the sum of weights divided by the sum of squared weights, can collapse from millions to hundreds.
Practical mitigations include weight clipping, self normalized IPS, and large batch sizes. Clip weights at a maximum, for example min(weight, 20), which reintroduces some bias but dramatically reduces variance. Self normalized IPS divides each weight by the sum of weights in the batch, which is unbiased only in expectation but stabilizes training. Monitor effective sample size and expand exploration if it drops too low. If you have only 1 percent exploration and try to debias positions 1 through 20, variance will be unmanageable. Consider hybrid approaches that use IPS for top 5 positions and architectural separation for deeper positions.
💡 Key Takeaways
•IPS reweights training examples by 1 divided by p(seen at position), upweighting rare observations to correct position bias
•Statistically unbiased if propensities are accurate, but variance can explode for low propensity events in positions rarely examined
•At 20,000 QPS with 2 percent exploration, you get 400 QPS randomized traffic or 34 million impressions daily to estimate propensities
•Weight clipping at maximum 10 to 20 reintroduces bias but prevents single examples from dominating gradients and destabilizing training
•Effective sample size can collapse from millions to hundreds when propensities are low, requiring expanded exploration or hybrid approaches
📌 Examples
Estimate p(seen at position 1) equals 0.90, p(seen at position 5) equals 0.30, p(seen at position 10) equals 0.08. A click at position 10 gets weight 12.5, amplifying its signal. A non click at position 1 gets weight 1.11, minimal adjustment.
Airbnb search uses self normalized IPS to debias click models. They clip weights at 20 and normalize within each minibatch to stabilize training. This reduces variance while maintaining approximate unbiasedness over many batches.