Learn→Data Quality & Validation→Data Quality Dimensions (Accuracy, Completeness, Consistency)→1 of 5

Data Quality & Validation • Data Quality Dimensions (Accuracy, Completeness, Consistency)Easy⏱️ ~2 min

What are Data Quality Dimensions?

Definition
Data Quality Dimensions are measurable properties that determine whether data is fit for its intended use. The three foundational dimensions are accuracy, completeness, and consistency.

At scale, bad data drives bad decisions at scale. A pricing model trained on corrupted data can lose millions per day. A recommendation engine fed incomplete events shows users irrelevant content. Data quality dimensions transform vague complaints like "data is bad" into explicit, measurable properties you can monitor and enforce.

Accuracy: Is it Correct?

Accuracy measures correctness relative to the real world. If an orders table shows a user paid $9.99 but the payment processor charged $19.99, that data is inaccurate. This goes beyond format validation. A value can be syntactically valid but semantically wrong. Typical checks include comparing derived data against trusted sources, or validating realistic ranges: latitude between negative 90 and 90, user age between 13 and 120.

Completeness: Is Everything Present?

Completeness measures whether all expected data arrived. A dataset can be accurate but incomplete. You might have all orders that did arrive perfectly recorded, but be missing 1 percent due to Kafka lag or a broken ingestion job. Completeness is quantified as a percentage: for example, 99.9 percent of expected events for a given hour actually arrived in the warehouse.

Consistency: Does it Agree With Itself?

Consistency measures whether data aligns across different views or systems. Within a table, this covers constraints like unique identifiers and referential integrity. Across systems, it means user state in the data warehouse matches the source systems within expected lag. You can tolerate some delay, but not permanent contradictions like an order marked "refunded" in one table and "completed" in another forever.

✓ In Practice: These dimensions are not abstract theory. Companies like Uber processing 20 billion events per day define explicit Service Level Objectives (SLOs) for each dimension per dataset, treating violations like availability incidents.

💡 Key Takeaways

✓Accuracy means semantic correctness relative to real world truth, not just syntactic validity of data types or formats

✓Completeness measures what percentage of expected records actually arrived, typically tracked per time window and data source

✓Consistency ensures data agrees with itself within tables through constraints and across systems through reconciliation

✓Each dimension requires different enforcement strategies: accuracy at ingestion, completeness through counting, consistency via audits

✓Production systems define explicit SLOs per dimension per dataset, such as 99.9 percent completeness within 30 minutes

📌 Interview Tips

1Accuracy violation: Mobile SDK bug swaps latitude and longitude. Values pass numeric range checks but all locations are wrong by hundreds of kilometers.

2Completeness issue: Expected 2 million events between 10:00 and 10:05 UTC based on historical baseline, but only 1.2 million arrived due to partition lag.

3Consistency failure: User profile shows status as active in cache but suspended in source database, creating contradictory application behavior.

← Back to Data Quality Dimensions (Accuracy, Completeness, Consistency) Overview