Data Quality Monitoring and SLA Enforcement

The Observability Challenge: At scale, you cannot manually verify data quality. With tens of thousands of tables and petabytes of data being processed daily, you need automated systems that continuously monitor freshness, completeness, accuracy, and distribution shifts. This is where governance intersects with data quality.

A production quality monitoring system tracks three categories of metrics for each dataset. Freshness metrics measure when data was last updated. For a core revenue table, you might require updates every hour with p95 pipeline completion under 10 minutes. Completeness metrics check null rates and record counts. A daily user activity summary should have less than 0.1 percent null rate on key dimensions. Distribution metrics detect anomalies: if average order value suddenly jumps by 3 standard deviations, something is wrong.

Quality Monitoring Impact
30 min
ALERT THRESHOLD
< 0.1%
NULL RATE TARGET
How Enforcement Works: Pipelines emit metrics about record counts, null rates, distribution histograms, and latency to a central quality service. A rules engine compares these against expectations defined in governance metadata. Rules have severity levels. If a core fact table freshness exceeds SLA by 30 minutes, the system pages the data owner and posts to incident channels. If a non critical dimension table has a 5 percent increase in null values, it logs a warning but does not block.

Here's where it gets interesting: quality monitoring must balance availability against correctness. Aggressive enforcement (fail pipelines on minor schema or distribution changes) improves correctness but reduces availability. A streaming feature pipeline that blocks on slight schema drift may degrade machine learning serving availability below the desired 99.9 percent. More tolerant policies keep systems up but risk silent quality regressions.

Real Production Pattern: Netflix and Uber describe integrating quality status directly into data portals. Before using a dataset, you see: green (fresh and healthy), yellow (SLA warning), or red (SLA violation or quality failure). This prevents analysts from building dashboards on stale data or ML engineers from training models on datasets with active quality issues.

⚠️ Common Pitfall: The most dangerous failure mode is false confidence. A dashboard shows green status, but the underlying rules are incomplete. For example, a revenue table passes basic null checks, yet a change in source field semantics was not captured, causing 5 percent revenue under reporting for a week. Column level lineage and semantic versioning help catch these issues.
The Implementation Trade-off: Rich quality monitoring adds overhead. Capturing detailed distribution histograms for every column in every partition can increase pipeline runtime by 5 to 10 percent and storage by similar amounts. Some teams implement tiered monitoring: critical datasets get full monitoring, while exploratory datasets get basic checks only. The governance metadata specifies which tier each dataset belongs to, driving automated monitoring configuration.

💡 Key Takeaways

✓Data quality monitoring tracks freshness (update recency), completeness (null rates), and distribution (anomaly detection) for each dataset

✓Production systems use severity based alerting: core table missing 30 minute SLA triggers pages, non critical table with 5% null increase logs warnings

✓Aggressive quality enforcement improves correctness but reduces availability, creating a fundamental trade-off between failing fast and keeping systems running

✓Quality status is exposed in data portals (green/yellow/red indicators) so users know freshness and health before building dashboards or training models

✓Detailed monitoring adds 5 to 10 percent overhead in runtime and storage, leading to tiered monitoring approaches where critical datasets get full checks

📌 Interview Tips

1A revenue table with hourly updates requires: p95 completion under 10 minutes, less than 0.1% null rate on revenue amount, and distribution within 3 standard deviations of 30 day moving average

2When a pipeline processes user activity events, completeness checks verify expected record counts per region per hour, alerting if any region shows more than 20% deviation from historical patterns

3A feature store serving ML models might tolerate 5 minute freshness SLA violations for non critical features but require sub minute recovery for features affecting real time bidding systems

← Back to Data Governance Framework Overview