Learn→Feature Engineering & Feature Stores→Feature Monitoring (Drift, Missing Values, Outliers)→3 of 4
Feature Engineering & Feature Stores • Feature Monitoring (Drift, Missing Values, Outliers)Medium⏱️ ~3 min
Streaming vs Batch Monitoring: Latency, Cost, and Complexity Tradeoffs
Monitoring architecture sits on a spectrum between streaming (real time aggregation) and batch (periodic computation). Streaming monitoring emits a lightweight event for every inference, maintaining rolling window aggregates in memory with 1 to 5 minute detection latency. Batch monitoring periodically processes logs (hourly or daily), computing feature statistics from cold storage with detection latency of hours to days. The choice hinges on risk tolerance, throughput, and operational complexity.
Streaming provides sub minute alert latency, critical for high risk systems like fraud detection or real time bidding where bad predictions directly cost money. A fraud model at 5,000 transactions per second (TPS) with streaming monitoring can detect and circuit break a compromised feature within 60 to 180 seconds, limiting exposure to roughly 300,000 to 900,000 affected transactions. Batch monitoring with 1 hour latency would expose 18 million transactions. The cost is serving path complexity and CPU overhead. With 150 features and 1:10 sampling, the system processes 75,000 feature updates per second (5k TPS × 150 features ÷ 10). Histogram and quantile updates consume 1 to 2% of serving CPU; at scale, this requires dedicated aggregator instances (typically two 8 vCPU instances with headroom).
Batch monitoring is simpler and cheaper, suitable for low risk or low volume workloads. A weekly retrained recommendation model with offline metrics might use daily batch monitoring, computing full feature distributions from sampled logs in the data warehouse. Detection latency of 12 to 24 hours is acceptable when retraining cadence is 7 days and model changes are gradual. Infrastructure cost drops to zero marginal serving cost; monitoring runs as a scheduled Spark or Beam job on existing compute clusters. The tradeoff is delayed incident detection and coarser granularity (daily snapshots miss intraday patterns).
Hybrid architectures are common in production. Uber's Michelangelo runs streaming monitoring for critical path features (user location, time of day, identity features) with 5 minute windows and Statistical Process Control (SPC) alerting, while batch monitoring handles secondary features and long term trend analysis with daily windows. Netflix uses streaming for prediction distribution and acceptance rate (business critical, needs fast feedback loop detection), paired with batch monitoring for detailed per feature drift analysis and correlation studies that inform retraining decisions.
💡 Key Takeaways
•Streaming monitoring achieves 1 to 5 minute detection latency by maintaining in memory rolling aggregates; fraud system at 5k TPS detects issues within 60 to 180 seconds, limiting exposure to 300k to 900k transactions versus 18M with hourly batch
•CPU overhead for streaming: 1 to 2% of serving resources at 1:10 sampling; 5k TPS with 150 features requires two 8 vCPU aggregator instances to handle 75k updates per second with headroom for approximate algorithms (t-digest, HyperLogLog)
•Batch monitoring processes logs periodically (hourly to daily), zero marginal serving cost, suitable for low risk workloads where 12 to 24 hour detection latency is acceptable (weekly retrained models, low volume services)
•Sampling strategies: start with 1:10 for high QPS (reduces CPU by 90%), validate drift metric stability versus full stream; low traffic segments may need 1:1 sampling or extended windows to reach minimum 5k events per window
•Hybrid pattern: stream critical path features (identity, location, price) with 5 minute windows and SPC alerts, batch monitor secondary features with daily windows for trend analysis and correlation studies
•Storage tradeoff: streaming retains only rolling state (roughly 30 MB per model), batch requires sampled logs (0.1 to 1% of traffic) with 30 to 90 day retention for forensics, adding terabytes for high throughput systems
📌 Examples
Uber Michelangelo: streaming for critical path features (user location, request time) with 5 min windows, batch for secondary features and daily trend dashboards; segmented by city to localize drift before global rollout
Netflix ranking: streaming monitors prediction distribution and acceptance rate (feedback loop risk, needs fast detection), batch monitoring for detailed per feature drift analysis across thousands of features, informing weekly retrain decisions
Airbnb pricing model: batch daily monitoring sufficient for 24 to 72 hour label delays; streaming overlay for prediction drift and approval rate as early warning, triggering investigation before labels arrive