Data Modeling & Schema Design • Schema Evolution StrategiesHard⏱️ ~2 min
Expand and Contract Pattern for Safe Schema Evolution
The Deployment Problem:
How do you change a schema in a live system with millions of events per second and hundreds of consumers, some of which deploy monthly? The naive approach is to change the schema and hope everyone upgrades quickly. In practice, this causes outages, data loss, and weeks of incident response.
The expand and contract pattern is a disciplined three phase approach used at companies like LinkedIn and Netflix to evolve schemas safely over months, allowing gradual migration without breaking existing consumers.
Phase One: Expand:
First, you expand the schema in a backward compatible way. If you want to rename total to total_amount, you do not remove total yet. Instead, you add total_amount as a new optional field and write both fields with the same value. Producers now emit dual format data.
This phase requires no consumer changes. Old consumers continue reading total. New consumers can start reading total_amount. Both work simultaneously. You run in this dual mode for a migration window, typically measured in weeks or months depending on your deployment cadence.
For a high volume topic at 200 thousand events per second, writing both fields increases payload size by perhaps 10 to 20 percent for numeric fields. This costs extra network and storage but buys you safety. The alternative is coordinating a lockstep deployment across thousands of services, which is operationally impossible at that scale.
Phase Two: Migrate Consumers:
During the expand phase, you update consumers to use the new field. This happens gradually. Critical services upgrade first, then batch jobs, then long tail BI dashboards. You monitor consumer lag and error rates to ensure no one is breaking.
Some consumers may never upgrade. For these, you maintain the old field indefinitely or deprecate it with a long sunset period. At Netflix scale, some internal tools deploy quarterly. You cannot block schema evolution on their schedule, so you design for partial migration.
✓ In Practice: LinkedIn uses governance tools to track which consumers read which fields. Before contracting (removing old fields), they run reports showing zero reads of the deprecated field over a 30 day window. Only then is it safe to proceed to phase three.
Phase Three: Contract:
After all critical consumers have migrated and you have verified zero usage of the old field, you contract the schema by removing total. Producers stop writing it. The schema is now in its target state, with only total_amount.
This phase requires careful validation. You check audit logs, consumer metrics, and run contract tests to ensure no hidden dependencies. A premature contract will break unmigrated consumers, forcing a rollback and restarting the entire process.
Trade Offs and Costs:
Expand and contract is slow and expensive. For a high value schema change, the full cycle may take three months. You pay storage and compute costs for dual format data during the expand phase. You also pay coordination costs, as schema changes require governance review and consumer communication.
The alternative is faster but riskier. You could use breaking changes with version bumps, creating payment_event_v1 and payment_event_v2 as separate topics. This avoids the expand phase but fragments your data model. Downstream joins and aggregations now need to union across versions, adding complexity forever.
Most production systems use expand and contract for core shared schemas and accept breaking version bumps for experimental or low value data. The decision depends on how many consumers you have and how critical correctness is. A payments pipeline will invest months in expand and contract. A debug logging pipeline will break and fix.💡 Key Takeaways
•Expand phase adds new fields while keeping old fields, writing dual format data that increases payload size by 10 to 20 percent for the migration window
•Migration phase updates consumers gradually over weeks or months, with governance tools tracking which consumers read which fields before allowing contraction
•Contract phase removes old fields only after verifying zero usage over 30 days, preventing premature removal that would break unmigrated consumers
•Full expand and contract cycle for high value schemas takes 3 months at companies like LinkedIn and Netflix, paying storage costs for safety at 200 thousand events per second
•Alternative approach of version bumping (payment_event_v1, payment_event_v2) avoids expand phase but fragments data model, requiring downstream joins to union across versions forever
📌 Examples
LinkedIn payment event rename: Expand by adding total_amount while keeping total for 6 weeks. Migrate 200 consumers over 8 weeks, monitoring reads. Contract after 30 days of zero total reads. Full cycle: 14 weeks.
Netflix high volume topic: At 200k events/sec, expand phase writing both total and total_amount adds 20% payload size. Over 4 week expand phase, extra storage cost is 0.8 PB (200k * 4 bytes * 60 sec * 60 min * 24 hr * 28 days * 1.2x = ~800 TB additional).