Data Quality & Validation • Data Contracts & SLAsMedium⏱️ ~3 min
Implementation: Building the Contract Infrastructure
The Four Layer Architecture: Production grade contract systems need a registry layer, validation layer, observability layer, and governance layer. Each serves a distinct purpose in the contract lifecycle.
Layer One: Contract Registry
Producers define schemas, field level metadata, semantic descriptions, and SLA definitions in a central system. Contracts reference specific storage like message topics, event queues, or warehouse tables. Versioning is explicit:
Layer Three: Observability and SLIs
Service Level Indicators (SLIs) are computed continuously. For a
order_events.v1 versus order_events.v2 with machine readable compatibility rules.
Layer Two: Validation Pipeline
1
CI Time Validation: When developers change schemas, CI jobs validate against declared contracts. Checks include prohibited actions like dropping required fields, changing data types, or altering semantic units without versioning.
2
Ingestion Time Validation: Runtime validators check incoming data. Streaming systems use schema aware consumers that reject malformed messages. Batch systems run validation before loading: sampling or full scans checking schema, null percentages, uniqueness, and value range constraints.
3
Quarantine Handling: Invalid data routes to quarantine topics or tables for investigation rather than corrupting downstream systems.
clickstream topic, SLIs include ingestion lag at p95 and p99 percentiles, daily completeness (actual events versus expected from upstream counters), schema violation rate, and derived table availability. Each SLI has a target: "99 percent of events available within 10 minutes" or "schema violations under 0.1 percent."
⚠️ Common Pitfall: Tracking only averages hides systematic failures. A pipeline promising 03:00 UTC delivery at 99.9% success might fail every month end when volume spikes 3x. Use percentile based metrics and error budget tracking.
Layer Four: Governance Process
A data platform team owns the registry, common libraries, validators, and observability stack. Domain teams own their contracts and meet SLAs. Governance handles escalations: breaking change requests, SLA renegotiations when traffic grows from 10,000 to 100,000 events per second, or ownership transfers during team reorganizations.
Error Budgets Drive Priority: If a pipeline burns its quarterly error budget (example: violates SLA 5 times when budget allows 2 violations), producers or platform must prioritize reliability work over new features. This creates accountability similar to site reliability engineering practices.💡 Key Takeaways
✓Four architectural layers: Registry for schemas and SLAs, Validation at CI and ingestion time, Observability for SLI tracking, Governance for escalations
✓CI validation blocks incompatible schema changes before deployment; ingestion validation routes bad data to quarantine instead of corrupting downstream
✓SLIs must track percentiles not averages: a pipeline might meet 99.9% average SLA but fail systematically at month end during 3x volume spikes
✓Error budgets quantify allowed deviation per quarter; burning the budget triggers mandatory reliability work prioritization
✓Platform team owns infrastructure; domain teams own their contracts and meeting SLAs; governance handles breaking changes and ownership transfers
📌 Examples
1SLI configuration for <code>clickstream</code>: ingestion lag p95/p99, daily completeness rate, schema violation rate < 0.1%, derived <code>sessions</code> table availability
2Error budget enforcement: pipeline with 2 violation budget that fails SLA 5 times in a quarter must pause features for reliability work
3Governance escalation: traffic growing from 10k to 100k events/sec requires SLA renegotiation and infrastructure scaling