Data Quality & ValidationData Quality Dimensions (Accuracy, Completeness, Consistency)Hard⏱️ ~3 min

Failure Modes and Edge Cases in Data Quality

When Quality Metrics Lie: The most dangerous data quality failures are those that pass all your checks. Your accuracy validators show green, completeness is 100 percent, consistency constraints hold, but the data is fundamentally wrong. Understanding these edge cases is critical for building resilient systems. The Semantic Accuracy Trap: A mobile Software Development Kit (SDK) bug swaps latitude and longitude fields. All values are numeric and within valid ranges: latitudes between negative 90 and 90, longitudes between negative 180 and 180. Your schema validation passes. Range checks pass. But every location is wrong by hundreds or thousands of kilometers. Users in New York appear in the Indian Ocean. Detecting this requires statistical monitoring, not just rule based validation. You need to track distribution shifts: median location per city, geographic heatmaps, or correlation between declared and inferred location. When median coordinates for "New York" users suddenly shift to the ocean, you alert. The lesson: accuracy is not just "valid values" but "semantically correct values."
❗ Remember: Unit mismatches are another semantic failure. A client sends cents, server expects dollars. All values pass as positive numbers, but aggregates like average fare drop by 100x overnight. Static validation cannot catch this without domain knowledge.
The Hidden Completeness Failure: Global event volume looks normal: 2 million events per 5 minute window, right on baseline. But a single Kafka partition is stuck, losing 10,000 events per minute for 5 percent of users. Aggregate monitoring misses this completely. The failure is hidden in the distribution. Edge cases multiply with legitimate traffic spikes. A sports betting service sees 5x normal traffic during a championship game. A naive completeness alert tuned to "3x deviation from baseline" fires constantly, training operators to ignore alerts. You need contextual baselines: scheduled events, day of week patterns, seasonal trends. Without this, alert fatigue destroys your monitoring effectiveness.
Partition Failure Hidden in Aggregate Metrics
2M
TOTAL EVENTS (NORMAL)
1 partition
STUCK (5% USERS IMPACTED)
Consistency Across Schema Evolution: An orders system adds a new status: partially_refunded. The reporting system still expects binary refunded or not refunded states. Now a single order appears as "completed" in one dashboard and "refunded" in another that incorrectly maps partially_refunded to fully refunded. Both systems pass their internal consistency checks. The failure only appears when you compare cross system aggregates. This happens constantly when multiple teams evolve schemas independently. Prevention requires data contracts: explicit agreements on field meanings, valid values, and evolution rules. Breaking changes must go through review and coordinated deployment. Clock Skew and Temporal Inconsistency: Systems rely on client timestamps for ordering events. Users with skewed device clocks create impossible sequences: "delivered" before "shipped", purchase timestamp 10 minutes in the future. This breaks state machine invariants and downstream Service Level Objectives (SLOs). The fix is to use server side timestamps for ordering, treating client timestamps as metadata only. But this introduces its own edge case: if ingestion lags by 20 minutes during an outage, all events get recent server timestamps and appear "out of order" relative to actual occurrence time. You need both: client timestamp for business logic, server timestamp for technical ordering. The False Precision Problem: Data looks perfect: accuracy to the cent, 100 percent completeness, all consistency constraints hold. But a key business rule changed. A promo code is now applied differently, but the metric definition did not update. Revenue reports are precisely wrong. This is a governance failure, not a data quality failure in the traditional sense. It causes extremely hard to debug issues because all your quality dashboards show green. Prevention requires treating metric definitions as code: versioned, reviewed, and deployed with the same rigor as application logic. Changes to business rules must trigger metric definition updates in lockstep.
💡 Key Takeaways
Semantic accuracy failures pass validation but are fundamentally wrong, like swapped coordinates or unit mismatches causing 100x aggregate errors
Partition level failures hide in aggregate metrics, losing data for subset of users while total volume appears normal
Schema evolution across independent teams creates consistency violations when systems map new values differently without coordinated contracts
Client timestamp reliance creates temporal inconsistencies from clock skew, requiring dual timestamp strategy for business logic versus technical ordering
False precision occurs when data is technically correct but semantically wrong due to business rule changes not reflected in metric definitions
📌 Examples
1Mobile SDK swaps latitude and longitude. All values pass range validation but user locations shift thousands of kilometers, detected only through statistical distribution monitoring.
2Single Kafka partition stuck loses 10,000 events per minute affecting 5 percent of users while aggregate topic throughput of 2 million events per 5 minutes looks completely normal.
3Orders system adds partially refunded status but reporting system maps it to fully refunded, causing revenue discrepancy of $50,000 per day between dashboards despite both passing consistency checks.
4Client with clock 10 minutes fast creates purchase event in future, breaking downstream fraud detection rules that assume temporal ordering.
← Back to Data Quality Dimensions (Accuracy, Completeness, Consistency) Overview
Failure Modes and Edge Cases in Data Quality | Data Quality Dimensions (Accuracy, Completeness, Consistency) - System Overflow