Learn→ML Infrastructure & MLOps→Model Governance (Compliance, Auditability)→4 of 6

ML Infrastructure & MLOps • Model Governance (Compliance, Auditability)Hard⏱️ ~3 min

Continuous Monitoring for Drift, Bias, and Policy Violations

Definition
Continuous monitoring is the operational control loop detecting drift, bias, and policy violations in near real-time—catching harmful changes before they accumulate impact.
DATA DRIFT MEASUREMENT
Compare feature distributions to training baseline. PSI: <0.1 stable, 0.1-0.2 investigate, >0.2 significant drift. Compute per feature every 5 minutes on sliding window. At 50K RPS, sample 1% to keep computation tractable. Use KL divergence or KS tests for more sensitivity.
PERFORMANCE MONITORING
Join predictions with ground truth as labels arrive (fraud labels: hours, CTR: seconds). Compute rolling AUC, precision, recall over 1h and 24h windows. Alert if 1h AUC drops >5 points below 24h average—indicates sudden performance cliff.
💡 Insight: Monitor high-dimensional embeddings by tracking distribution of norms or principal components, not raw dimensions—computationally tractable at scale.
BIAS MONITORING
Check fairness across protected attributes: demographic parity (prediction rate difference), equalized odds (TPR/FPR difference), calibration disparity. Compute subgroup metrics hourly by joining predictions with demographics.
AUTOMATED POLICY GATES
Define thresholds: if PSI >0.3 for 15 min or subgroup AUC <0.75, auto-rollback within 2 min. Maintain rollback stack of last 3 approved versions. Trigger escalation with diagnostic runbook.
⚠️ Trade-off: Sensitive thresholds catch issues faster but cause false alarms. Tune based on historical incident data and business tolerance for false positives.

💡 Key Takeaways

✓Population Stability Index (PSI) quantifies data drift with thresholds: less than 0.1 is stable, 0.1 to 0.2 warrants investigation, greater than 0.2 requires intervention such as retraining or traffic diversion to a previous model version

✓At 50,000 Requests Per Second (RPS), sample 1 percent of traffic (500 RPS) for drift computation every 5 minutes to keep costs tractable, alert only if PSI exceeds 0.2 for three consecutive windows to avoid false positives from transient spikes

✓Performance monitoring joins predictions with delayed ground truth labels (fraud confirmed hours later, Click Through Rate or CTR known in seconds), compute rolling Area Under the Curve (AUC) over 1 hour and 24 hour windows, alert if 1 hour drops more than 5 points below 24 hour baseline

✓Bias metrics like demographic parity difference (positive rate gap across groups) and equalized odds (True Positive Rate or TPR and False Positive Rate or FPR gaps) are computed hourly, alert if parity exceeds 5 percent or subgroup AUC drops more than 3 points, require minimum 1000 samples per group to avoid denominator instability

✓Automated policy gates divert traffic to last known good model within 2 minutes when thresholds breach (PSI greater than 0.3 for 15 minutes, subgroup AUC below 0.75), maintain rollback stack of last three approved versions for fast recovery

✓For systems without timely labels (long term outcomes), use proxy metrics like prediction confidence distributions, shadow model agreement rates, or user engagement signals (Netflix monitors play rate and completion rate as proxies for recommendation quality)

📌 Interview Tips

1Fraud detection system computes PSI on transaction_amount and merchant_category features every 5 minutes, sample 500 RPS from 50K total, alert fires when PSI=0.25 sustained for 3 windows (15 min), incident runbook triggers rollback to model v3.1 within 2 minutes

2Meta fairness monitoring samples 10,000 predictions per hour, joins with user demographics (age, gender, region), computes demographic parity: positive_rate_groupA minus positive_rate_groupB, alerts if difference exceeds 5%, escalates to Responsible AI review board for investigation and potential model retrain

3Netflix recommendation model without ground truth labels monitors prediction confidence (entropy of top 10 scores) and user engagement (play rate within 24 hours), sudden drop in play rate from 65% to 55% triggers alert, investigation finds upstream data pipeline dropped a key feature causing drift

← Back to Model Governance (Compliance, Auditability) Overview