Model Monitoring & ObservabilityData Drift DetectionMedium⏱️ ~3 min

Cost, Scale, and Trade-off Analysis

Sampling rate determines both statistical power and infrastructure cost. For a high volume service at 50,000 requests per second, monitoring every event would generate 4.3 billion events per day. Sampling at 0.1% to 1% is typically sufficient; at 0.2% sampling, you capture 100 events per second (6,000 per minute), which accumulates 360,000 events in a 1 hour tumbling window. This provides ample statistical power for stable drift tests across dozens of features and segments while reducing log volume by 99.8%. Storage cost drops from approximately 200 terabytes per month to 400 gigabytes per month, and compute cost for drift analysis drops from tens of thousands of dollars to hundreds of dollars per month. Feature coverage versus test explosion is a critical cost trade off. Monitoring every feature across every segment creates combinatorial explosion. With 300 raw features and 50 segments, naive full cross monitoring requires 15,000 feature by segment combinations tested every window. Practical mitigations include monitoring only top K features by importance (typically 50 to 100 features ranked by Shapley Additive Explanations (SHAP) values), aggregating long tail segments into Other buckets, rotating deep checks on lower priority features weekly rather than per window, and using feature families to group correlated features. Batch versus streaming processing trades detection latency against infrastructure complexity and cost. Batch drift pipelines run daily or hourly using simple scheduled jobs, are easy to debug and replay, and cost approximately $5,000 to $10,000 per month for a system processing 100 million transactions per day. However, batch detection extends incident blast radius; a pipeline bug or model degradation can affect users for 12 to 24 hours before detection. Streaming pipelines using Kafka or Flink provide 5 to 15 minute detection latency, containing incidents within minutes, but require stateful stream processing expertise, cost 3x to 5x more (approximately $20,000 to $40,000 per month), and introduce operational complexity around backpressure, late arriving data, and watermark management. The ultimate trade off is sensitivity versus alert fatigue. Tight thresholds (PSI greater than 0.1, KS p less than 0.001) catch subtle shifts but generate dozens to hundreds of alerts per day at high traffic volumes, overwhelming on call engineers. Coarse thresholds (PSI greater than 0.5, KS p less than 0.01 with Bonferroni correction) miss real problems. Production systems find the sweet spot by layering: broad sensitive monitoring feeds a dashboard for exploration, medium thresholds (PSI greater than 0.25) generate informational alerts, and only high severity combinations (large effect size AND multiple windows AND business KPI impact) page humans. Meta has publicly discussed tuning alert thresholds to maintain 2 to 5 actionable alerts per day per ML system while catching 95% plus of incidents validated post hoc.
💡 Key Takeaways
Sampling 0.1% to 1% of traffic reduces storage from 200 terabytes to 400 gigabytes per month (99.8% reduction) while 0.2% sampling at 50,000 requests per second still yields 360,000 events per hour for stable statistical tests
Monitoring top 50 to 100 features by SHAP importance instead of all 300 features reduces test count from 15,000 to 500 to 1,000 per window, cutting compute cost by 90% plus with minimal recall loss
Streaming pipelines cost 3x to 5x more than batch ($20,000 to $40,000 versus $5,000 to $10,000 per month) but reduce detection latency from 12 to 24 hours to 5 to 15 minutes, containing incident blast radius
Alert tiering prevents fatigue: sensitive thresholds (PSI greater than 0.1) feed dashboards, medium thresholds (PSI greater than 0.25) generate info alerts, only high severity (effect size AND persistence AND KPI drop) pages humans
Meta tunes to 2 to 5 actionable alerts per day per ML system while maintaining 95% plus validated incident recall by requiring multiple signals (drift plus business metric degradation plus traffic percentage affected)
📌 Examples
Uber fraud detection samples 0.3% of 120 million daily transactions, generating 360,000 samples per day across 25 regions; batch pipeline costs approximately $8,000 per month and completes nightly drift checks in under 20 minutes
Netflix recommendation moved from daily batch (24 hour detection) to 30 minute sliding windows (10 minute detection); infrastructure cost increased from $12,000 to $45,000 per month but reduced incident impact from 20 million affected sessions to 2 million
Airbnb pricing monitors top 60 features by SHAP importance out of 280 total features; this covers 92% of model explanation while reducing per window test count from 14,000 to 600, keeping compute under 50 CPU seconds per run
← Back to Data Drift Detection Overview