Feature Engineering & Feature StoresFeature Monitoring (Drift, Missing Values, Outliers)Medium⏱️ ~3 min

What is Feature Monitoring and Why Track Drift, Missing Values, and Outliers?

Feature monitoring tracks the health of model inputs and outputs in production along three critical axes. Drift detection identifies when feature distributions shift from their training baseline, potentially degrading model performance. Missing value tracking catches upstream data quality issues or pipeline failures that could break predictions. Outlier monitoring flags extreme values that may indicate data corruption, schema changes, or legitimate edge cases the model never saw during training. The core pattern establishes a training time baseline for each feature by computing summary statistics and distributions from the final training dataset. During inference, the system continuously summarizes live traffic over rolling windows (like 5 minutes for alerts, 1 to 24 hours for trends) and compares these live windows to the baseline using statistical tests. For numerical features, track count, missing rate, mean, variance, robust quantiles (1st, 5th, 50th, 95th, 99th percentiles), and compact histograms. For categorical features, maintain top K frequency tables, entropy, new category rate, and estimated cardinality using HyperLogLog sketches. At scale, this becomes a streaming aggregation challenge. A recommendations model at 25,000 queries per second (QPS) with 150 features generates 1.5 million predictions per minute and 225 million feature observations per minute. Even with sampling at 1:10 ratio, the system processes 375,000 updates per second. Using approximate algorithms like t-digest for quantiles (less than 1% error, logarithmic updates) and fixed bin histograms (roughly 100 bins at 200 bytes each), memory per feature per segment stays around 20 kilobytes. For 150 features across 10 active segments, total memory is roughly 30 megabytes per model, making this CPU bound rather than memory bound. Production systems at Netflix use dimensional metrics with high cardinality tags for real time alerting via control charts, with seasonality aware baselines reducing false positives for consumer traffic waves. Uber's Michelangelo platform includes training serving skew checks and distribution shift monitoring as part of model promotion, segmenting by market and city to localize drift before global rollout.
💡 Key Takeaways
Three monitoring axes: drift (distribution shifts from training), missing values (data quality issues), and outliers (extreme values indicating corruption or edge cases)
Baseline capture at training time includes count, missing rate, mean, variance, quantiles (1/5/50/95/99), histograms for numerical features; top K frequencies, entropy, cardinality for categorical features
Statistical tests for comparison: Population Stability Index (PSI, threshold 0.1 slight, 0.2 significant), Kolmogorov Smirnov (K-S) for continuous distributions, Wasserstein for shape sensitive shifts, Statistical Process Control (SPC) rules like 1 point beyond 3 sigma
Scale example: 25k QPS system with 150 features processes 375k updates per second at 1:10 sampling, uses 30 MB memory per model with approximate algorithms (t-digest, HyperLogLog)
Time aware monitoring uses multiple windows: 5 minute for near real time alerts (60 to 180 second latency), 1 to 24 hour for trend analysis and seasonality detection
Segment aware slicing by country, platform, or cohort prevents Simpson's paradox; Netflix uses seasonality aware baselines to reduce false positives by over 70% versus static training baselines
📌 Examples
Netflix recommendations: dimensional metrics with high cardinality tags, control chart alerting, dynamic 7 day rolling baselines by hour of day to handle 5 to 8% oscillation in acceptance rate across time zones
Uber fraud detection at 5k transactions per second: robust metrics (median, Median Absolute Deviation) for heavy tailed features, tail exceedance rate monitoring (target P(X > Q99.9) < 0.05%), circuit breakers when card_country null rate exceeds 2%
Airbnb Zipline: stores training time feature distributions alongside features, validates serving distributions against baselines during deployment, feature level data contracts prevent schema drift from silently propagating
← Back to Feature Monitoring (Drift, Missing Values, Outliers) Overview