Data Quality & Validation • Data Contracts & SLAsMedium⏱️ ~2 min
How Data Contracts Work at Scale
The Scale Challenge: At companies processing thousands of events per second across hundreds of microservices, uncontrolled changes cascade into chaos. An ecommerce order service handling 5,000 writes per second publishes
Different SLAs for Different Needs: Fraud scoring needs events within 500 milliseconds p99 latency to block suspicious transactions. This drives architectural choices: streaming ingestion with low latency storage. Batch analytics only needs Timestamp plus 15 minutes freshness but requires 99.9 percent daily completeness. This allows cheaper batch ingestion.
Runtime Enforcement: At ingestion time, validators check incoming data against contracts. For streaming systems, schema aware consumers reject or route invalid messages to quarantine topics. For batch warehouse loads, validation steps sample or scan data, checking schema, null rates, uniqueness constraints, and value ranges. LinkedIn and Netflix build observability layers computing Service Level Indicators (SLIs) like freshness, row counts, and anomaly scores on each dataset.
OrderPlaced events to both real time fraud models and batch analytics. Without a contract, renaming a field silently corrupts downstream aggregates within minutes at 10,000 events per second.
The Contract Registry: Producers register schemas in a central catalog with machine readable compatibility rules. When the order service wants to change from amount in cents to dollars, the contract specifies "backward compatible changes only" with 90 day deprecation for breaking changes. The registry might be built on tools like schema registries that integrate with streaming platforms.
Real Time Fraud vs Batch Analytics
500ms
P99 FRAUD SLA
15min
ANALYTICS SLA
✓ In Practice: Producer Continuous Integration (CI) pipelines validate schema changes against contracts before deployment. Incompatible changes fail the build, shifting failures left to development time rather than 2 AM on call incidents.
Organizational Integration: Downstream applications declare dependencies on specific contract versions. A recommendation service states it depends on user_profile.v3 with p95 2 minute freshness and recent_activity.v2 with 99.5 percent completeness. Platform teams understand blast radius and prioritize which sources need strongest guarantees.💡 Key Takeaways
✓At 10,000 events per second scale, schema changes without contracts corrupt dashboards and ML models within minutes
✓Different consumers need different SLAs: fraud detection requires 500ms p99 latency, batch analytics needs 15 minute freshness with 99.9% completeness
✓Producer CI pipelines validate schema changes against contracts before deployment, failing builds for incompatible changes
✓Runtime validators at ingestion check schemas, null rates, uniqueness, and value ranges, routing invalid data to quarantine topics
✓Downstream applications declare explicit dependencies on contract versions and required SLAs, helping platform teams prioritize infrastructure investments
📌 Examples
1Order service at 5,000 writes/sec publishes to fraud models (500ms p99 SLA) and batch analytics (15min freshness SLA)
2Schema registry enforces 90 day backward compatibility: changing <code>amount</code> from cents to dollars requires new version plus deprecation period
3Recommendation service depends on <code>user_profile.v3</code> (p95 2min freshness) and <code>recent_activity.v2</code> (99.5% completeness)