Data Quality & ValidationData Contracts & SLAsMedium⏱️ ~3 min

Implementation: Building the Contract Infrastructure

The Four Layer Architecture: Production grade contract systems need a registry layer, validation layer, observability layer, and governance layer. Each serves a distinct purpose in the contract lifecycle. Layer One: Contract Registry Producers define schemas, field level metadata, semantic descriptions, and SLA definitions in a central system. Contracts reference specific storage like message topics, event queues, or warehouse tables. Versioning is explicit: order_events.v1 versus order_events.v2 with machine readable compatibility rules. Layer Two: Validation Pipeline
1
CI Time Validation: When developers change schemas, CI jobs validate against declared contracts. Checks include prohibited actions like dropping required fields, changing data types, or altering semantic units without versioning.
2
Ingestion Time Validation: Runtime validators check incoming data. Streaming systems use schema aware consumers that reject malformed messages. Batch systems run validation before loading: sampling or full scans checking schema, null percentages, uniqueness, and value range constraints.
3
Quarantine Handling: Invalid data routes to quarantine topics or tables for investigation rather than corrupting downstream systems.
Layer Three: Observability and SLIs Service Level Indicators (SLIs) are computed continuously. For a clickstream topic, SLIs include ingestion lag at p95 and p99 percentiles, daily completeness (actual events versus expected from upstream counters), schema violation rate, and derived table availability. Each SLI has a target: "99 percent of events available within 10 minutes" or "schema violations under 0.1 percent."
⚠️ Common Pitfall: Tracking only averages hides systematic failures. A pipeline promising 03:00 UTC delivery at 99.9% success might fail every month end when volume spikes 3x. Use percentile based metrics and error budget tracking.
Layer Four: Governance Process A data platform team owns the registry, common libraries, validators, and observability stack. Domain teams own their contracts and meet SLAs. Governance handles escalations: breaking change requests, SLA renegotiations when traffic grows from 10,000 to 100,000 events per second, or ownership transfers during team reorganizations. Error Budgets Drive Priority: If a pipeline burns its quarterly error budget (example: violates SLA 5 times when budget allows 2 violations), producers or platform must prioritize reliability work over new features. This creates accountability similar to site reliability engineering practices.
💡 Key Takeaways
Four architectural layers: Registry for schemas and SLAs, Validation at CI and ingestion time, Observability for SLI tracking, Governance for escalations
CI validation blocks incompatible schema changes before deployment; ingestion validation routes bad data to quarantine instead of corrupting downstream
SLIs must track percentiles not averages: a pipeline might meet 99.9% average SLA but fail systematically at month end during 3x volume spikes
Error budgets quantify allowed deviation per quarter; burning the budget triggers mandatory reliability work prioritization
Platform team owns infrastructure; domain teams own their contracts and meeting SLAs; governance handles breaking changes and ownership transfers
📌 Examples
1SLI configuration for <code>clickstream</code>: ingestion lag p95/p99, daily completeness rate, schema violation rate < 0.1%, derived <code>sessions</code> table availability
2Error budget enforcement: pipeline with 2 violation budget that fails SLA 5 times in a quarter must pause features for reliability work
3Governance escalation: traffic growing from 10k to 100k events/sec requires SLA renegotiation and infrastructure scaling
← Back to Data Contracts & SLAs Overview