ML Infrastructure & MLOpsModel Governance (Compliance, Auditability)Hard⏱️ ~3 min

Continuous Monitoring for Drift, Bias, and Policy Violations

Continuous monitoring is the operational control loop that detects when models drift out of compliance or performance boundaries. Unlike batch audits performed quarterly, production monitoring runs in near real time (typically 5 to 15 minute windows) to catch harmful changes before they accumulate significant impact. Key metrics include data drift (input distribution shift), prediction drift (output distribution shift), performance degradation (accuracy, Area Under the Curve or AUC, precision, recall), and bias metrics across protected groups (demographic parity difference, equalized odds). For a recommendation model serving 50,000 Requests Per Second (RPS), this means computing Population Stability Index (PSI) and Kullback Leibler (KL) divergence on sampled traffic and triggering alerts when thresholds are breached. Data drift is measured by comparing recent feature distributions to a reference baseline, typically the training distribution. Population Stability Index (PSI) quantifies this. PSI less than 0.1 indicates stability, 0.1 to 0.2 warrants investigation, and greater than 0.2 signals significant drift requiring intervention. Compute PSI per feature every 5 minutes on a sliding window. Alert if PSI exceeds 0.2 for three consecutive windows (15 minutes sustained drift). At 50,000 RPS, sample 1 percent of traffic (500 RPS) to keep computation tractable. More sophisticated systems use Kullback Leibler divergence or Kolmogorov Smirnov tests. For high dimensional embeddings, monitor the distribution of norms or principal components rather than raw dimensions. Performance monitoring requires delayed labels. Fraud labels may arrive hours later, ad Click Through Rate (CTR) is known within seconds. Design your monitoring pipeline to join predictions with ground truth as labels arrive. Compute rolling metrics (AUC, precision at threshold, recall) over 1 hour and 24 hour windows. Alert if 1 hour AUC drops more than 5 points below the 24 hour moving average, indicating a sudden performance cliff. For systems without timely labels, use proxy metrics like prediction confidence distributions or agree/disagree rates with a shadow challenger model. Netflix monitors user engagement (play rate, completion rate) as a proxy for recommendation quality. Bias monitoring checks fairness across protected attributes. Compute metrics like demographic parity (difference in positive prediction rates across groups), equalized odds (difference in True Positive Rate or TPR and False Positive Rate or FPR), or calibration disparity. Meta's internal tools evaluate fairness at scale by sampling predictions and joining with user demographics, computing subgroup metrics every hour. Alert if demographic parity difference exceeds 5 percent or if subgroup AUC drops more than 3 points below the overall AUC. The challenge is denominator instability for small groups. Use confidence intervals and require minimum sample sizes (for instance, 1000 examples per group per window) before alerting. Policy gates automate responses. Define stop the line thresholds. If PSI exceeds 0.3 for 15 minutes or subgroup AUC drops below 0.75, automatically divert traffic to the last known good model within 2 minutes. Maintain a rollback stack of the last three approved model versions. Trigger human in the loop escalation with an incident runbook that includes diagnostic queries (which features drifted?), rollback procedures, and stakeholder notification. Microsoft's Responsible AI workflows integrate monitoring alerts with governance dashboards that surface issues to model owners and compliance officers in real time.
💡 Key Takeaways
Population Stability Index (PSI) quantifies data drift with thresholds: less than 0.1 is stable, 0.1 to 0.2 warrants investigation, greater than 0.2 requires intervention such as retraining or traffic diversion to a previous model version
At 50,000 Requests Per Second (RPS), sample 1 percent of traffic (500 RPS) for drift computation every 5 minutes to keep costs tractable, alert only if PSI exceeds 0.2 for three consecutive windows to avoid false positives from transient spikes
Performance monitoring joins predictions with delayed ground truth labels (fraud confirmed hours later, Click Through Rate or CTR known in seconds), compute rolling Area Under the Curve (AUC) over 1 hour and 24 hour windows, alert if 1 hour drops more than 5 points below 24 hour baseline
Bias metrics like demographic parity difference (positive rate gap across groups) and equalized odds (True Positive Rate or TPR and False Positive Rate or FPR gaps) are computed hourly, alert if parity exceeds 5 percent or subgroup AUC drops more than 3 points, require minimum 1000 samples per group to avoid denominator instability
Automated policy gates divert traffic to last known good model within 2 minutes when thresholds breach (PSI greater than 0.3 for 15 minutes, subgroup AUC below 0.75), maintain rollback stack of last three approved versions for fast recovery
For systems without timely labels (long term outcomes), use proxy metrics like prediction confidence distributions, shadow model agreement rates, or user engagement signals (Netflix monitors play rate and completion rate as proxies for recommendation quality)
📌 Examples
Fraud detection system computes PSI on transaction_amount and merchant_category features every 5 minutes, sample 500 RPS from 50K total, alert fires when PSI=0.25 sustained for 3 windows (15 min), incident runbook triggers rollback to model v3.1 within 2 minutes
Meta fairness monitoring samples 10,000 predictions per hour, joins with user demographics (age, gender, region), computes demographic parity: positive_rate_groupA minus positive_rate_groupB, alerts if difference exceeds 5%, escalates to Responsible AI review board for investigation and potential model retrain
Netflix recommendation model without ground truth labels monitors prediction confidence (entropy of top 10 scores) and user engagement (play rate within 24 hours), sudden drop in play rate from 65% to 55% triggers alert, investigation finds upstream data pipeline dropped a key feature causing drift
← Back to Model Governance (Compliance, Auditability) Overview