Data Quality & Validation • Data Reconciliation TechniquesMedium⏱️ ~3 min
How Reconciliation Works at Scale
The Architecture Pattern: Production reconciliation systems follow a configuration driven pipeline. Teams define which tables or entities to reconcile, how to match records (the join key), and which columns or metrics to compare. An orchestrator triggers jobs per entity or per source pair on a schedule, typically hourly or daily.
Each reconciliation job has three stages that process data in a distributed compute environment like Spark.
Real World Scale Example: Consider reconciling an on premises billing system against a new cloud based system. You have hundreds of tables with billions of rows total. A Spark cluster processes each table, extracting data from both sides and performing distributed joins.
For a single orders table with 500 million records, the system might find that 99.97% of
The Integration Story: Reconciliation is not a standalone tool. At companies like Netflix and Airbnb, reconciliation checks are wired into workflow orchestrators. Yesterday's batch ETL job won't publish dashboards to business users until reconciliation validates that aggregates are consistent. This tight integration means data quality issues are caught before they impact decisions.
1
Preprocess: Extract data from each source, limited by partitions like day or hour to manage volume. Normalize types and formats into a canonical schema, resolving time zones, numeric precision, and encodings. Perform a distributed join on configured keys.
2
Compare: Walk each joined row and apply rules per column. Check equality, apply tolerances (currency within 0.01, timestamps within 5 seconds), or compute similarity scores. Generate per row mismatch flags.
3
Postprocess: Aggregate mismatch counts per column and compute match percentages. Apply thresholds to generate health labels (green for 99.9%, amber for 98%, red below). Store results in a metrics store and trigger alerts when thresholds are breached.
amount values match, 99.95% of currency values match, and 99.8% of records match across all checked columns.
Typical Performance Metrics
5-10 min
P50 LATENCY
30 min
P99 TARGET
1M+
ROWS/MINUTE
"Reconciliation is tightly integrated with ingestion, transformation, monitoring, and incident management. It's part of your system reliability story, not an offline audit tool."
💡 Key Takeaways
✓Reconciliation jobs run as distributed compute pipelines with three stages: preprocess (extract and normalize), compare (apply per column rules), and postprocess (aggregate metrics and alert)
✓At scale, systems process millions of records per minute, targeting p50 latency of 5 to 10 minutes and p99 under 30 minutes for hourly reconciliation jobs running on medium sized clusters
✓Results are aggregated at three levels: per column match rates (99.97% of amount values), per table match rates (99.8% overall), and health labels (green/amber/red based on thresholds)
✓Production systems integrate reconciliation into workflow orchestrators, blocking downstream dashboards or reports until data quality is validated
📌 Examples
1A billing reconciliation between on premises and cloud systems with 500 million order records might use Spark to join on order_id, compare amount and currency fields, and produce column level summaries showing 99.97% match rate on amounts
2Netflix style data platforms wire reconciliation checks into Airflow DAGs, ensuring yesterday's ETL job validates aggregates before updating executive dashboards