Learn→Data Quality & Validation→Data Quality Dimensions (Accuracy, Completeness, Consistency)→2 of 5
Data Quality & Validation • Data Quality Dimensions (Accuracy, Completeness, Consistency)Medium⏱️ ~3 min
How Data Quality Dimensions Work Across the Pipeline
The Core Challenge:
At scale, data flows through multiple stages: ingestion from clients, event streaming, processing, warehousing, and serving to analytics or machine learning. Each stage can degrade quality. The key insight is that you enforce and monitor each dimension at different points in this pipeline, not just once.
Accuracy at Ingestion:
For a ride sharing app processing millions of trip events, accuracy starts at the edge. Each event includes coordinates, timestamps, fare, and identifiers. Input validation enforces basic rules: timestamp not more than 5 minutes in the future, coordinates within city bounds, fare non negative. This catches obviously incorrect data before it spreads downstream. Deeper semantic checks happen later, such as "total fare equals sum of components within 1 cent" or "trip duration matches distance given typical speed distributions."
Completeness Through Counting:
A common pattern is event count and cardinality checks per topic, per partition, per time window. If you usually see around 2 million events between 10:00 and 10:05 UTC and today you see only 1.2 million, an alert triggers. At companies like Netflix, warehouse ingestion compares transaction counts against source Online Transaction Processing (OLTP) systems every hour. SLAs define expectations: "99.95 percent of source transactions must arrive within 30 minutes."
Consistency Through Reconciliation:
Data gets denormalized and replicated across systems. A user profile exists in OLTP, cache, Kafka, and warehouse. With eventual consistency, you accept timing differences. But invariants matter: "no orphaned orders without a user" and "status transitions are valid." At Meta scale, consistency checks run as data audits that reconcile aggregates between systems, like total ad spend per advertiser between billing and reporting. These audits run daily or hourly, and discrepancies above 0.1 percent trigger incident review.
Mobile Clients
Generate Events
↓
Ingestion Layer
Schema + Range Checks
↓
Stream Processing
Cardinality Counting
↓
Warehouse + Serving
Reconciliation Audits
Completeness Detection Timeline
NORMAL
2M events
→
TODAY
1.2M events
→
ACTION
Alert fired
💡 Key Takeaways
✓Accuracy is enforced earliest at ingestion with schema validation and range checks, catching obviously incorrect data before it spreads
✓Completeness monitoring uses baseline comparisons per time window, alerting when actual counts deviate significantly from expected patterns
✓Consistency requires cross system reconciliation jobs that compare aggregates, accepting small lag but flagging permanent contradictions
✓Each pipeline stage applies appropriate checks: lightweight validation at high throughput ingestion, heavier audits in batch warehouse processing
✓Production systems treat data quality violations like availability incidents, with on call rotations responding to breaches of defined SLOs
📌 Examples
1Ingestion accuracy: Ride event validation rejects timestamp 10 minutes in future or coordinates outside service area before writing to Kafka.
2Completeness monitoring: Warehouse ingestion compares hourly transaction count from source database to arrived events, alerting if gap exceeds 0.05 percent.
3Consistency audit: Nightly job compares total revenue per country from checkout service versus data warehouse, opening incident if difference exceeds 0.5 percent.