Statistical Methods for Drift Detection and Alerting
THRESHOLD-BASED ALERTING
The simplest approach: alert when a metric crosses a fixed threshold. Accuracy < 85%, alert. P99 latency > 100ms, alert.
Advantages: Easy to understand, easy to implement, fast to evaluate.
Disadvantages: Does not account for normal variation. A metric fluctuating between 87-90% should not alert at 87%. Requires careful threshold tuning per metric.
Improvement: use percentile-based thresholds. Alert when metric is below 5th percentile of historical values rather than a fixed number. Adapts to natural variation.
STATISTICAL PROCESS CONTROL
Apply statistical methods to detect when metrics deviate from expected behavior.
Control charts: Track metric mean and standard deviation. Alert when value exceeds mean ± 3σ. Established industrial quality control method.
CUSUM (Cumulative Sum): Detects small sustained shifts that single-point thresholds miss. Accumulates deviations from target; alerts when cumulative sum exceeds threshold. Good for gradual degradation.
Page-Hinkley test: Similar to CUSUM but with adaptive detection threshold. Better for varying drift rates.
ANOMALY DETECTION FOR ALERTS
Train a model on historical metric values. Flag current values that are anomalous given history. More sophisticated than fixed thresholds.
Approaches: Isolation Forest on metric vectors, autoencoder reconstruction error, Prophet for time series with seasonality.
Trade-off: More sophisticated detection catches more issues but produces more complex alerts. Start simple, add complexity when simple methods miss real problems.