Streaming vs Batch Monitoring: Latency, Cost, and Complexity Tradeoffs
The Architecture Spectrum
Monitoring architecture sits on a spectrum between streaming (real time aggregation) and batch (periodic computation). The choice affects detection latency, infrastructure cost, and operational complexity. Most production systems use a combination tuned to feature criticality and freshness requirements.
Streaming Monitoring
Emits a lightweight event for every inference, maintaining rolling window aggregates in memory with 1 to 5 minute detection latency. Requires always on infrastructure (Flink, Kafka Streams, or custom services) that processes every request. Cost scales with request volume: at 10,000 QPS, processing 864 million events per day demands significant compute and storage. Best suited for high value, latency sensitive features where catching drift within minutes justifies the cost.
Batch Monitoring
Periodically processes logs (hourly or daily), computing feature statistics from cold storage with detection latency ranging from 1 to 24 hours. Dramatically cheaper since compute runs only during batch windows and storage uses commodity object stores. For features where hourly detection suffices, batch monitoring costs 10 to 50x less than streaming equivalents. Most features at most companies can tolerate batch monitoring.
Tiered Strategy
Implement tiered monitoring where critical features (top 10 by importance, fraud signals, safety features) use streaming with 5 minute detection. Important features (top 50) use hourly batch. Remaining features use daily batch. This stratification optimizes cost while maintaining rapid detection for high impact signals.
Log Sampling Trade-offs
For extremely high volume systems, sample logs for batch monitoring (1 to 10 percent sample rates). Statistical significance degrades: detecting 5 percent drift requires larger samples than detecting 20 percent drift. Adaptive sampling over samples rare events (errors, outliers) while under sampling common patterns.