Definition
Feature monitoring tracks the health of model inputs and outputs in production along three critical axes: drift detection (distribution shifts from training baseline), missing value tracking (upstream data quality issues), and outlier monitoring (extreme values indicating data corruption or attacks).
Why These Three Axes Matter
Drift detection identifies when feature distributions shift from their training baseline, potentially degrading model performance. Missing value tracking catches upstream data quality issues or pipeline failures that could break predictions. Outlier monitoring flags extreme values that may indicate data corruption, edge case inputs, or adversarial attacks requiring special handling.
The Silent Degradation Problem
Without feature monitoring, model degradation goes unnoticed until business metrics decline. A gradual drift in user behavior causes recommendation quality to drop 3 to 5 percent over weeks. By the time product teams notice lower engagement, months of revenue have been lost. Proactive monitoring catches drift early, enabling retraining before user impact.
Production Implementation
Production feature monitoring computes statistical summaries (mean, variance, quantiles, null rate) over sliding time windows, comparing against baseline distributions. Population Stability Index (PSI) and Kolmogorov Smirnov (K-S) tests quantify drift magnitude. Alerting triggers when metrics exceed thresholds sustained for configurable durations, balancing sensitivity against alert fatigue.
Coverage Strategy
Monitor all features feeding high value models. For large feature spaces (hundreds of features), prioritize based on feature importance scores from model interpretation. Critical features with high SHAP values warrant tighter thresholds and faster detection windows than low importance features.
✓Three monitoring axes: drift (distribution shifts from training), missing values (data quality issues), and outliers (extreme values indicating corruption or edge cases)
✓Baseline capture at training time includes count, missing rate, mean, variance, quantiles (1/5/50/95/99), histograms for numerical features; top K frequencies, entropy, cardinality for categorical features
✓Statistical tests for comparison: Population Stability Index (PSI, threshold 0.1 slight, 0.2 significant), Kolmogorov Smirnov (K-S) for continuous distributions, Wasserstein for shape sensitive shifts, Statistical Process Control (SPC) rules like 1 point beyond 3 sigma
✓Scale example: 25k QPS system with 150 features processes 375k updates per second at 1:10 sampling, uses 30 MB memory per model with approximate algorithms (t-digest, HyperLogLog)
✓Time aware monitoring uses multiple windows: 5 minute for near real time alerts (60 to 180 second latency), 1 to 24 hour for trend analysis and seasonality detection
✓Segment aware slicing by country, platform, or cohort prevents Simpson's paradox; Netflix uses seasonality aware baselines to reduce false positives by over 70% versus static training baselines
1Netflix recommendations: dimensional metrics with high cardinality tags, control chart alerting, dynamic 7 day rolling baselines by hour of day to handle 5 to 8% oscillation in acceptance rate across time zones
2Uber fraud detection at 5k transactions per second: robust metrics (median, Median Absolute Deviation) for heavy tailed features, tail exceedance rate monitoring (target P(X > Q99.9) < 0.05%), circuit breakers when card_country null rate exceeds 2%
3Airbnb Zipline: stores training time feature distributions alongside features, validates serving distributions against baselines during deployment, feature level data contracts prevent schema drift from silently propagating