ML-Powered Search & RankingFeature Engineering for RankingMedium⏱️ ~3 min

Label Engineering: Creating Training Labels From Implicit Feedback

Core Concept
Label engineering transforms raw user signals (clicks, purchases, dwell time) into training labels that reflect true relevance rather than display effects. This is feature engineering for your labels, not your inputs.

Raw Signals Are Biased Labels

A click at position 1 doesn't mean the same thing as a click at position 10. Position 1 gets 10x more clicks regardless of quality. If you use raw clicks as positive labels, you train the model to predict position, not relevance. Label engineering starts by recognizing that raw signals are contaminated by presentation effects: position, device, time of day, surrounding items.

Propensity-Weighted Labels

Create a label weight based on display propensity. Run 1-5% exploration traffic with randomized positions. Build a position-to-propensity lookup: P(click|position). Weight each training example by 1/propensity. A click at position 10 (propensity 0.05) gets weight 20; position 1 (propensity 0.5) gets weight 2. This rebalances training to approximate what clicks would look like if all items were shown equally.

Multi-Signal Label Aggregation

Single signals are noisy. Combine multiple user actions: label = 0.3 × click + 0.5 × add_to_cart + 1.0 × purchase. Different signals have different noise levels and business value. Clicks are high volume but noisy; purchases are low volume but high confidence. Time-weighted: a click with 30+ second dwell is stronger than a 2-second bounce. The weights become tunable hyperparameters.

Position as Feature vs Position for Debiasing

Two uses: (1) Include position as input feature during training, set to constant (e.g., position 1) at serving. Model learns to factor out position effects. (2) Use position only for label weighting, never as feature. Model trains on debiased labels but never sees position. Approach 1 requires careful implementation to avoid leakage. Approach 2 needs accurate propensity estimates. Most production systems use both.

💡 Key Takeaways
Raw user signals (clicks) are biased by position, device, and presentation effects; they need engineering before use as labels
Propensity-weighted labels use 1/P(click|position) to rebalance training data as if all items were shown equally
Multi-signal aggregation combines clicks, purchases, dwell time with different weights reflecting confidence and business value
Position can be used as an input feature (set to constant at serving) or only for label weighting (propensity scores)
Most production systems combine both: propensity-weighted labels AND position as a feature during training
📌 Interview Tips
1Frame label engineering as "feature engineering for your labels" - this shows understanding that labels themselves need engineering
2Give specific formula: label = 0.3 × click + 0.5 × add_to_cart + 1.0 × purchase with weights as tunable hyperparameters
3Explain the two uses of position: as an input feature (set constant at serving) vs for label weighting only (never as feature)
← Back to Feature Engineering for Ranking Overview
Label Engineering: Creating Training Labels From Implicit Feedback | Feature Engineering for Ranking - System Overflow