Data Quality & ValidationData Anomaly DetectionMedium⏱️ ~3 min

How Anomaly Detection Works: Rules vs Models

The Two Approaches: Data anomaly detection uses either static rule based checks or dynamic model based approaches. Understanding when to use each is critical for building reliable detection systems. Rule Based Detection: You define explicit thresholds for each metric. For example, "alert if row count is below 4 million or above 7 million" or "flag if user_id null ratio exceeds 0.5%". These rules are deterministic and easy to explain to stakeholders. When an alert fires, you can point to the exact threshold that was violated. The problem is that data evolves. Your daily row count might grow from 5 million to 15 million over six months due to product growth. Now your upper threshold of 7 million triggers false positives constantly, requiring manual updates to every rule. Model Based Detection: Instead of fixed thresholds, the system learns what normal looks like from historical data. For example, AWS Glue Data Quality analyzes past runs to build baselines, then predicts an expected range for the next batch. If your row count has been growing 2% per week for the last 30 days, the model expects next week's count to be around 5.1 million, not the static 5 million from a month ago. It automatically adapts to trends and seasonality.
Detection Latency Comparison
< 5 min
BATCH DETECTION
< 2 min
STREAMING DETECTION
The Learning Process: Model based systems require a warmup period. AWS Glue needs at least three historical data points before it can start predicting, but accuracy improves significantly with more history. A typical setup uses 30 to 90 days of past metrics to establish baselines. The system captures patterns like weekday versus weekend traffic (which might vary by 40%), monthly payment cycles, or gradual growth trends. When a new metric falls outside the predicted bounds, it triggers an anomaly. Handling Seasonality: This is where models shine. An e-commerce site might see 10x traffic during Black Friday. A static rule would either miss real anomalies during normal periods (if set too loose) or trigger constant false alarms during peak events (if set too tight). A model trained on full year cycles learns that November spikes are normal and adjusts expected ranges accordingly.
💡 Key Takeaways
Rule based detection uses fixed thresholds (row count between 4M and 7M) that are easy to explain but require manual updates as data evolves
Model based detection learns from 30 to 90 days of history, automatically adapting to growth trends (2% weekly increase) and seasonality (weekday vs weekend patterns)
AWS Glue requires minimum 3 historical runs to start predictions, improving accuracy with more data points over time
Detection latency varies: batch systems check after each job (under 5 minutes), streaming systems detect within 1 to 2 minutes at 10k to 100k events per second
📌 Examples
1A retail pipeline writes 5M rows daily during normal weeks but 50M during Black Friday. Rule based detection with a fixed 7M upper bound would fail. Model based detection trained on yearly data recognizes November spikes as normal.
2User signups grow 15% month over month. A static rule flagging row counts above 1.2M becomes obsolete in three months. An adaptive model automatically adjusts the expected range from 1.2M to 1.8M.
3Weekend traffic drops 40% compared to weekdays. Model based detection learns this pattern and expects 3M rows on Saturday versus 5M on Tuesday, avoiding false positives.
← Back to Data Anomaly Detection Overview