Model Monitoring & ObservabilityData Drift DetectionMedium⏱️ ~3 min

Statistical Tests for Drift Detection

Different statistical tests are suited for different feature types and sensitivity requirements. For continuous features, the Kolmogorov Smirnov (KS) test measures the maximum difference between cumulative distribution functions and is sensitive to any distributional change. The Wasserstein distance (also called Earth Mover Distance) measures how much probability mass must be moved to transform one distribution into another, making it interpretable in feature units. Population Stability Index (PSI) compares binned histograms and is widely used because it is fast to compute and provides interpretable thresholds. For categorical features, Chi square tests compare observed versus expected frequencies across categories but require large sample sizes in each bin. PSI also works for categoricals by treating each category as a bin. High cardinality categoricals pose special challenges because long tail categories violate test assumptions; practical solutions include merging rare categories into an Other bucket or tracking only the top N categories plus summary statistics like HyperLogLog cardinality and entropy for the tail. Multivariate methods detect joint distribution shifts that univariate tests miss. Maximum Mean Discrepancy (MMD) with a Radial Basis Function (RBF) kernel compares distributions in a high dimensional feature space. Adversarial validation trains a lightweight classifier to distinguish reference data from current data; if the model achieves AUC significantly above 0.5 (random guessing), drift is present. An AUC above 0.7 is a practical red flag. The feature importances from this classifier directly identify which features are driving the drift. The multiple testing problem becomes severe at scale. With 300 features monitored across 50 segments in 2 overlapping time windows, you run 30,000 tests per evaluation cycle. Without correction, even a 5% false positive rate would generate 1,500 false alarms. Production systems apply False Discovery Rate (FDR) control using Benjamini Hochberg correction at 5% to 10%, require effect size thresholds alongside statistical significance, and cluster correlated features into families to reduce redundant alerts.
💡 Key Takeaways
Kolmogorov Smirnov test is sensitive at large sample sizes, Population Stability Index provides interpretable thresholds (0.1 minor, 0.25 significant), and Wasserstein distance is meaningful in feature units
High cardinality categoricals require special handling: merge rare categories into Other bucket, track top 10 to 20 categories explicitly, use HyperLogLog for cardinality and entropy for tail distribution
Adversarial validation trains a classifier to distinguish reference versus current data; AUC greater than 0.7 indicates meaningful drift and feature importances identify root causes
With 300 features times 50 segments times 2 windows equals 30,000 tests, False Discovery Rate control via Benjamini Hochberg at 5% is essential to avoid alert fatigue from false positives
Combining statistical significance (p value less than 0.01) with effect size thresholds (PSI greater than 0.25 OR Wasserstein greater than 0.1 times interquartile range) filters noise from real drift
📌 Examples
Meta feed ranking monitors embedding drift using Maximum Mean Discrepancy on 32 dimension Principal Component Analysis (PCA) projections of 768 dimension sentence embeddings, avoiding expensive full dimensional comparisons
Uber fraud detection runs adversarial validation on 30 features; when AUC reaches 0.75, feature importances reveal that transaction amount distribution and merchant category frequencies are the primary drift drivers
Netflix recommendation system applies Benjamini Hochberg correction across 600 simultaneous tests (60 features times 10 segments), maintaining 5% False Discovery Rate while catching real drift in viewing time distribution
← Back to Data Drift Detection Overview
Statistical Tests for Drift Detection | Data Drift Detection - System Overflow