Model Monitoring & ObservabilityData Quality MonitoringMedium⏱️ ~3 min

Feature Drift Detection with PSI and Distribution Metrics

Feature drift occurs when the statistical distribution of input features shifts between training time and inference time, or gradually over the lifetime of a deployed model. Unlike label drift which tracks changes in outcomes, feature drift focuses on inputs and can silently degrade model performance even when prediction accuracy metrics look stable in aggregate. The Population Stability Index (PSI) is the workhorse metric for drift detection, quantifying distribution divergence between a reference window (typically training data or recent baseline) and a comparison window (current production traffic). PSI works by binning feature values (10 bins for continuous features, actual categories for categorical features), computing the proportion of samples in each bin for reference and comparison distributions, then summing the weighted log ratio: PSI equals the sum over all bins of (comparison percent minus reference percent) times natural log of (comparison percent divided by reference percent). The result is a single scalar where PSI less than 0.1 indicates no significant change, 0.1 to 0.2 signals moderate drift warranting investigation, and greater than 0.2 indicates severe drift requiring action such as retraining or feature engineering review. Netflix recommendation models monitor PSI on 30 to 50 key features with alerts at 0.1 warn and 0.2 critical thresholds, recomputing every hour on the latest 24 hour window compared to a 7 day training baseline. PSI has limitations that teams supplement with additional metrics. First, it treats all bins equally regardless of importance to the model. A drift in rarely used tail values contributes as much as drift in the dense middle of the distribution, even though model predictions may only be sensitive to the middle. Second, PSI requires binning choices that affect sensitivity: too few bins miss granular shifts, too many bins create noise. Third, it assumes independence and does not catch multivariate drift where individual features look stable but joint distributions shift. To address these, production systems add Kulback Leibler (KL) divergence for asymmetric comparison when reference is ground truth, Jensen Shannon (JS) divergence for symmetric comparison, Wasserstein distance for distributions where ordering matters, and correlation matrix drift to catch joint distribution changes. Implementation at scale requires efficiency tricks. Computing exact histograms on billions of events is expensive, so systems use streaming sketches: t digest for quantile based binning (computes 10 bins from approximate percentiles using roughly 1 kilobyte of state per feature), count min sketch for categorical frequency estimation with sub linear memory, and reservoir sampling to maintain a representative sample of 10,000 to 100,000 examples for detailed analysis. Airbnb pricing model monitors 80 features at 15 minute granularity on 500,000 pricing requests per hour, computing PSI using t digest sketches with 3 millisecond per feature overhead. When PSI exceeds 0.15 on any feature, the system triggers detailed analysis on the sampled reservoir to identify which bins shifted and estimates model impact by re scoring the sample with the current model.
💡 Key Takeaways
Population Stability Index (PSI) quantifies drift via binned distribution comparison: PSI less than 0.1 indicates stability, 0.1 to 0.2 signals moderate drift, greater than 0.2 requires action like retraining
Netflix monitors PSI on 30 to 50 features hourly comparing latest 24 hour window to 7 day training baseline with alert thresholds at 0.1 warn and 0.2 critical for operational response
PSI limitations include equal weighting of all bins regardless of model sensitivity, sensitivity to binning choices (too few miss shifts, too many add noise), and blindness to multivariate joint distribution changes
Supplement PSI with Kulback Leibler divergence for asymmetric ground truth comparison, Jensen Shannon for symmetric cases, Wasserstein for ordered distributions, and correlation matrices for joint drift
Streaming implementation uses t digest for approximate quantile binning at roughly 1 kilobyte per feature, count min sketch for categorical frequencies, and reservoir sampling of 10,000 to 100,000 examples for deep dive analysis
Airbnb monitors 80 features at 15 minute granularity on 500,000 requests per hour using t digest with 3 millisecond per feature overhead, triggering sampled re scoring when PSI exceeds 0.15 to estimate model impact
📌 Examples
Meta News Feed ranking: monitors PSI on engagement_score, time_since_post, author_follower_count every 15 minutes; when PSI on engagement_score exceeded 0.22 during a viral event, reservoir sample analysis showed shift from mean 0.3 to 0.5 impacting 12 percent of predictions by more than 20 percent
Uber ETA prediction: tracks PSI on traffic_density, historical_speed, time_of_day using 10 bins per feature; PSI of 0.18 on traffic_density during holiday weekend triggered investigation revealing model trained pre pandemic underestimated new traffic patterns
Airbnb pricing: computes PSI on occupancy_rate, neighborhood_demand, seasonality_index every hour; supplements with Wasserstein distance on price_per_night distribution (ordering matters) and correlation matrix on (occupancy_rate, neighborhood_demand) pair to catch joint shifts
← Back to Data Quality Monitoring Overview
Feature Drift Detection with PSI and Distribution Metrics | Data Quality Monitoring - System Overflow