Model Monitoring & ObservabilityModel Performance Degradation & AlertingMedium⏱️ ~2 min

Two Tier Monitoring: Service Health vs Model Quality

Production ML systems require separate monitoring planes that operate on different timescales and serve different purposes. Service health protects end user experience through infrastructure metrics. Model quality guards statistical performance and business outcomes. Conflating these creates either slow incident response or excessive false alarms. Service health Service Level Objectives (SLOs) track latency, throughput, error rates, and resource utilization. These must alert fast, within seconds to minutes, because they directly impact user experience. Google Ads auction scoring must complete within 100 milliseconds end to end, with individual model inference budgets under 10 to 20 milliseconds at p99. A latency spike to 150 milliseconds breaks the auction deadline and loses revenue immediately. Facebook content moderation models have hard error rate budgets where failures above 0.1% for 5 consecutive minutes trigger automatic traffic shifts to backup models. Model quality SLOs track prediction accuracy, calibration, fairness, and business Key Performance Indicators (KPIs). These alert slower, respecting label delay and statistical significance. Uber ETA predictions receive ground truth arrival times within minutes to hours, so hourly windows work. Ad conversion models with 7 to 28 day attribution windows need patient monitoring that accumulates events before alerting. Firing alerts on noisy short windows creates fatigue. Waiting too long allows degradation to compound. The solution is layered detection using immediate proxies plus delayed confirmation. At Netflix scale, serving thousands of Queries Per Second (QPS) per cluster, they log predictions asynchronously to avoid adding latency. Monitoring jobs run every 5 minutes on recent windows, checking score distributions and click through proxies. Definitive outcome metrics like viewing hours and retention run daily on backfilled data once labels stabilize. This two speed system catches service regressions in under 10 minutes while model quality alerts arrive within 24 hours with statistical confidence.
💡 Key Takeaways
Service SLOs protect user experience with subsecond to minute detection. Meta ads ranking alerts within 2 minutes if p99 scoring latency exceeds 20 milliseconds, automatically routing traffic to a faster fallback model.
Model quality SLOs respect statistical reality. DoorDash delivery time models wait for 3 consecutive hourly windows showing 5% median error increase before alerting, avoiding false alarms from single outlier windows.
Proxy metrics bridge the gap. LinkedIn feed ranking uses 15 minute windows of click through rate calibration and score entropy as early signals, validated against session depth and return rate metrics computed daily.
Label delay dictates monitoring speed. Stripe fraud models get dispute labels 30 to 90 days late, so they monitor rule trigger rates and manual review rates as same day proxies, with definitive fraud recall computed quarterly.
Canary deployments test both planes. Uber routes 2% of traffic to new ETA models, requiring both service metrics within 1% of baseline and prediction error within 3% after 2 hours before expanding to 10%, then 50%, then 100%.
Sampling reduces cost without losing signal. Google scales monitoring by logging 1% of predictions with full feature detail for distribution analysis, and 10% with scores only for outcome metrics, saving petabytes monthly while maintaining statistical power for weekly cohorts.
📌 Examples
Netflix recommendation model monitoring splits into online checks every 5 minutes on score distributions using 50,000 event windows, detecting a feature pipeline bug within 20 minutes, and offline daily evaluation of viewing hours per recommended title using backfilled watch data.
Meta ads auction monitors p99 inference latency with 30 second alert windows, firing pages when latency hits 22 milliseconds for 2 consecutive windows. Model quality uses 6 hour windows with 5 million impressions minimum, checking Click Through Rate (CTR) calibration before conversion data arrives days later.
Airbnb pricing model tracks service health with 1 minute error rate SLOs, model proxies with hourly booking rate correlation checks requiring 10,000 searches per market, and ground truth with weekly revenue per search metrics computed after stays complete 30 days later.
← Back to Model Performance Degradation & Alerting Overview
Two Tier Monitoring: Service Health vs Model Quality | Model Performance Degradation & Alerting - System Overflow