What is Data Anomaly Detection?

Definition
Data Anomaly Detection identifies unexpected patterns or deviations in data quality, volume, distribution, or business metrics that indicate potential pipeline failures, bugs, or data corruption.
The Core Problem: Infrastructure monitoring tells you if servers are up and queries are running, but it cannot tell you if your data is wrong. Your database might be healthy, all jobs complete successfully, but a bug silently drops 30% of records or fills a critical column with nulls. Meanwhile, downstream dashboards show misleading numbers and machine learning models degrade without anyone noticing for hours or days.

This is especially dangerous because bad data often looks normal at first glance. A recommender system fed corrupted events might show green infrastructure metrics while serving poor recommendations to millions of users. An analytics dashboard might report revenue figures that are technically computed correctly but based on incomplete input data.

What Gets Monitored: Data anomaly detection focuses on the data itself rather than the infrastructure. Common metrics include row counts per batch (did today's hourly job write 5 million rows as expected, or only 3 million?), null ratios in key columns (is user_id suddenly 15% null instead of the usual 0.1%?), distinct value counts (did the number of unique stores drop from 5,000 to 500?), distribution statistics like mean or 95th percentile values, schema changes (did a required field disappear?), and business metrics such as daily active users or order conversion rates.

Why Traditional Monitoring Fails: System metrics like CPU, memory, and query execution time cannot catch data quality issues. A pipeline can run perfectly from an infrastructure perspective while producing garbage output. You need separate detection focused on data characteristics, which is why companies build dedicated anomaly detection layers that profile datasets and flag unexpected behavior before corrupt data propagates to critical systems.

💡 Key Takeaways

✓Infrastructure monitoring shows system health (CPU, memory, query success) but cannot detect data quality problems like missing records, corrupt values, or distribution shifts

✓Typical monitored metrics include row counts, null ratios, distinct value counts, distribution percentiles, schema changes, and business aggregates like revenue or active users

✓Anomalies can be point based (single spike), contextual (unusual for this time or region), or collective (abnormal sequence like growing lag)

✓Detection must happen before bad data propagates to dashboards, machine learning models, or downstream systems where it causes real business impact

📌 Interview Tips

1A bug excludes one country from an aggregation job. Infrastructure looks healthy with all jobs completing successfully, but row count drops from 5.5M to 3.5M. Anomaly detection flags this within 5 minutes and halts dependent jobs.

2An upstream service stops sending <code>user_id</code> for mobile events. The null ratio in <code>user_id</code> jumps from 0.1% to 15%, breaking user attribution. Detection catches this before the corrupted batch reaches the ML feature store.

3A schema change removes a required field. Infrastructure metrics show normal throughput, but downstream models fail because they expect that field. Schema validation detects the missing field immediately.

← Back to Data Anomaly Detection Overview