Data Quality & ValidationData Quality Dimensions (Accuracy, Completeness, Consistency)Medium⏱️ ~3 min

How Data Quality Dimensions Work Across the Pipeline

The Core Challenge: At scale, data flows through multiple stages: ingestion from clients, event streaming, processing, warehousing, and serving to analytics or machine learning. Each stage can degrade quality. The key insight is that you enforce and monitor each dimension at different points in this pipeline, not just once.
Mobile Clients
Generate Events
Ingestion Layer
Schema + Range Checks
Stream Processing
Cardinality Counting
Warehouse + Serving
Reconciliation Audits
Accuracy at Ingestion: For a ride sharing app processing millions of trip events, accuracy starts at the edge. Each event includes coordinates, timestamps, fare, and identifiers. Input validation enforces basic rules: timestamp not more than 5 minutes in the future, coordinates within city bounds, fare non negative. This catches obviously incorrect data before it spreads downstream. Deeper semantic checks happen later, such as "total fare equals sum of components within 1 cent" or "trip duration matches distance given typical speed distributions." Completeness Through Counting: A common pattern is event count and cardinality checks per topic, per partition, per time window. If you usually see around 2 million events between 10:00 and 10:05 UTC and today you see only 1.2 million, an alert triggers. At companies like Netflix, warehouse ingestion compares transaction counts against source Online Transaction Processing (OLTP) systems every hour. SLAs define expectations: "99.95 percent of source transactions must arrive within 30 minutes."
Completeness Detection Timeline
NORMAL
2M events
TODAY
1.2M events
ACTION
Alert fired
Consistency Through Reconciliation: Data gets denormalized and replicated across systems. A user profile exists in OLTP, cache, Kafka, and warehouse. With eventual consistency, you accept timing differences. But invariants matter: "no orphaned orders without a user" and "status transitions are valid." At Meta scale, consistency checks run as data audits that reconcile aggregates between systems, like total ad spend per advertiser between billing and reporting. These audits run daily or hourly, and discrepancies above 0.1 percent trigger incident review.
💡 Key Takeaways
Accuracy is enforced earliest at ingestion with schema validation and range checks, catching obviously incorrect data before it spreads
Completeness monitoring uses baseline comparisons per time window, alerting when actual counts deviate significantly from expected patterns
Consistency requires cross system reconciliation jobs that compare aggregates, accepting small lag but flagging permanent contradictions
Each pipeline stage applies appropriate checks: lightweight validation at high throughput ingestion, heavier audits in batch warehouse processing
Production systems treat data quality violations like availability incidents, with on call rotations responding to breaches of defined SLOs
📌 Examples
1Ingestion accuracy: Ride event validation rejects timestamp 10 minutes in future or coordinates outside service area before writing to Kafka.
2Completeness monitoring: Warehouse ingestion compares hourly transaction count from source database to arrived events, alerting if gap exceeds 0.05 percent.
3Consistency audit: Nightly job compares total revenue per country from checkout service versus data warehouse, opening incident if difference exceeds 0.5 percent.
← Back to Data Quality Dimensions (Accuracy, Completeness, Consistency) Overview