Model Monitoring & Observability • Model Performance Degradation & AlertingMedium⏱️ ~3 min
Label Delay and Feedback Windows in Production Monitoring
Most ML systems lack immediate ground truth. Labels arrive hours, days, or months after predictions, creating a monitoring gap where degradation can compound before detection. Production systems bridge this with proxy metrics, delayed joins, and observation windows tuned to domain specific feedback latency.
The challenge is severity varies by domain. Uber ETA predictions get ground truth arrival times within minutes to hours, enabling tight feedback loops and hourly model evaluation. Ad conversion models wait 7 to 28 days for purchase attribution, making daily accuracy monitoring impossible without proxies. Credit risk models in lending see default labels 6 to 24 months late, requiring entirely proxy based early warning systems validated only on quarterly cohorts once enough outcomes materialize.
Production architectures log predictions with timestamps and identifiers, then asynchronously join labels when they arrive. Google Ads stores prediction cohorts in a feature store, joins conversion events as they trickle in over 28 days, and computes cumulative metrics on fixed observation windows. A model deployed on January 1 has predictions evaluated on February 1 after collecting 30 days of conversions. This creates reporting lag but enables definitive measurement. Meanwhile, minute level proxy metrics like click through rate calibration and output entropy distributions fire alerts within 5 minutes, giving early warning that's validated days later.
Censoring and missing labels complicate analysis. Not all predictions get labels. Users who don't convert within 28 days might convert on day 29, or never. Naive accuracy computed only on labeled examples is biased toward fast responders. LinkedIn handles this by tracking label arrival rates per segment, flagging when label coverage drops from typical 40% to 25%, indicating an upstream tracking issue rather than model degradation. They compute metrics on matched windows, comparing week 1 of the new model to week 1 of the old model with identical observation periods to ensure fair evaluation.
💡 Key Takeaways
•Proxy metrics detect faster but need validation. Facebook ads uses click through rate as a proxy, alerting within 30 minutes if CTR drops 5%, but waits for 7 day conversion data before deciding to roll back, avoiding false alarms from CTR noise that does not affect revenue.
•Observation windows must match across comparisons. Spotify recommendation metrics compare new model day 1 to 7 performance against old model day 1 to 7 performance with equal label coverage, not against old model final week performance that has more complete labels.
•Label arrival rate monitoring catches data issues. Amazon product recommendations track what percentage of impressions get click labels within 24 hours. A drop from 30% to 18% coverage indicated a client logging bug, not model degradation, saving days of misdirected debugging.
•Censoring bias affects long tail events. Zillow home sale price predictions see sale labels over 30 to 180 days. Naive accuracy on early sales overweights fast selling homes that are systematically different, requiring survival analysis or inverse propensity weighting to debias metrics.
•Incremental metric updates reduce waste. Rather than recomputing accuracy from scratch daily, DoorDash incrementally updates AUC by adding newly arrived labels to rolling 28 day cohorts, reducing compute cost by 80% while maintaining hourly freshness.
•Cold start predictions need separate tracking. LinkedIn job recommendations for new users lack historical engagement features. They report new user metrics separately with 48 hour observation windows, catching when cold start accuracy degrades 12% while overall metrics stay flat due to dominance of warm users.
📌 Examples
Meta ads conversion models log 50 billion predictions daily with 28 day attribution. They compute proxy CTR metrics on 5 minute windows with 1 million impression minimums, and definitive conversion lift on rolling 28 day cohorts updated daily as labels arrive, with alerts requiring both proxy and delayed metric alignment before action.
Uber Eats delivery time model gets ground truth within 30 to 90 minutes. They run evaluation pipelines every hour on predictions from 2 hours ago, achieving sub 3 hour detection of accuracy regressions while ensuring 90% label coverage per city.
Netflix recommendation system tracks streaming hours per recommended title with labels arriving over 7 days as users binge. They monitor immediate proxy of play rate within first 10 minutes, intermediate signal of completion rate at 24 hours with 60% label coverage, and final viewing hours at 7 days with 95% coverage, using all three signals to triangulate model quality.