Production Scale Detection Architecture

The System Landscape: At large scale companies, data flows from user facing services generating 100,000 events per second at peak, through streaming and batch pipelines, into data lakes storing terabytes daily. Downstream, hundreds of dashboards and dozens of machine learning models depend on this data with strict Service Level Agreements (SLAs): 15 minute latency for operational analytics, 1 hour for core ML features. Anomaly detection sits as an observability layer across this entire flow.

Three Layer Architecture: Production systems typically implement detection in three stages. First, a metrics collection layer runs lightweight profiling jobs that compute row counts, null ratios, distinct counts, and distribution statistics. These profilers execute immediately after each pipeline step, adding 30 to 90 seconds of overhead. Second, a feature store maintains historical metrics as time series, storing 90 to 365 days of data. Third, detection algorithms (rules, statistical models, or ML services) consume these time series and emit anomaly decisions.

1
Ingest Layer: Monitor event rates per topic, schema versions, parsing errors. Catch upstream bugs within minutes using streaming detection at p95 latency under 2 minutes.
2
Transformation Layer: Profile intermediate tables for row counts, join drop rates, distribution shifts. Detect when a join unexpectedly drops 30% of keys due to logic bugs.
3
Warehouse Layer: Monitor business aggregates like daily revenue, order conversion rate, active devices. Alert on metric shifts that indicate real business problems.
Real World Example: An e-commerce company runs hourly batch jobs aggregating orders per store into a fact table, typically writing 5 to 6 million rows per hour. Immediately after each batch completes, a profiling job computes row count, distinct store count, average order value, and 95th percentile order value, completing in under 60 seconds. Using 30 days of history, the detector predicts row count should be between 4.8 and 6.2 million. When a deployment bug excludes one country and row count drops to 3.5 million, detection fires within 5 minutes and automatically halts dependent jobs, preventing corrupt aggregates from reaching dashboards and ML models.

Streaming vs Batch Trade-off: Salesforce moved from batch based checks (finding problems in days) to streaming detection (alerts in minutes) by building detectors on top of their log pipeline. They process metrics with p99 latency below 2 minutes, enabling SRE style response times. However, streaming requires stateful processing, careful backpressure management, and 3x to 5x more infrastructure cost compared to batch detection that runs once per hour.

💡 Key Takeaways

✓Three layer architecture separates concerns: metrics collection (30 to 90s overhead per job), historical storage (90 to 365 days), and detection algorithms

✓Detection spans ingest (event rates, schemas), transformation (row counts, join quality), and warehouse (business metrics), each with different latency requirements

✓Streaming detection enables under 2 minute alerts at 100k events/sec but costs 3x to 5x more than batch detection that runs once per hour

✓Production systems add 30 to 90 seconds of profiling overhead per batch, writing metrics to time series stores for model training and anomaly comparison

📌 Interview Tips

1E-commerce hourly jobs write 5 to 6M rows per store aggregation. Profiler computes metrics in under 60 seconds. Detector compares against 30 day baseline (4.8M to 6.2M expected range). When bug drops count to 3.5M, alert fires within 5 minutes total.

2Salesforce streams metrics from log pipeline with p99 latency under 2 minutes, enabling operational response. They batch requests to ML model service and use horizontal scaling behind load balancers to handle metric volume.

3IoT system splits work: Raspberry Pi devices run simple range checks locally with millisecond response, forward summarized metrics to cloud for heavy models analyzing trends over hundreds of devices.

← Back to Data Anomaly Detection Overview