Data Contracts and Expectation Based Monitoring
WHAT ARE DATA CONTRACTS
A data contract is a formal specification of what data should look like. It defines expected schema (column names, types), value constraints (ranges, allowed categories), freshness requirements, and statistical properties (expected distributions, correlations).
Contracts make implicit assumptions explicit. Instead of hoping data is correct, you define what correct means and validate against it. When contracts are violated, you get alerts before bad data reaches the model.
EXPECTATION-BASED MONITORING
Expectations are specific testable conditions. Examples: user_age BETWEEN 0 AND 150, null_rate(email) < 0.01, unique_count(user_id) > 100000.
Tools like Great Expectations and dbt tests encode expectations as code. Each expectation runs against incoming data. Failures trigger alerts or block pipelines.
BUILDING EFFECTIVE CONTRACTS
Start from training data: Profile your training data. What were the value ranges? Null rates? Cardinalities? Use these as baseline expectations.
Add domain knowledge: Some constraints are not in training data but are logical. User ages cannot be negative. Prices cannot be more than $1M for most products.
Allow for expected variation: Do not set constraints too tight. A 5% null rate that varies between 4-6% does not need alerts. Set thresholds with buffer for normal variation.
CONTRACT EVOLUTION
Contracts are not static. As products evolve, expectations change. New categories appear. Value ranges expand. Review and update contracts quarterly or when major changes ship.
Version contracts alongside data schemas. When schema changes, update contracts. Maintain contract history for debugging (what were expectations when this bug occurred?).