Failure Modes: Backfills, Hot Keys, and Drift in Denormalized Systems
Schema Evolution Challenge
Adding a field to denormalized data requires backfilling existing rows. A new discount_percentage field on 100 million product rows takes hours to backfill. During backfill, some rows have the field, others do not. Application code must handle both. Strategy: deploy code that handles missing field gracefully, backfill incrementally (batches of 10,000 rows with pauses to avoid overwhelming the database), verify all rows updated, then make the field required. Normalized schemas have one row to update; denormalized schemas have millions.
Storage Cost Multiplication
Denormalization multiplies storage. A normalized product catalog might be 5 GB. Denormalizing into a search index, a recommendation cache, a listing cache, and a mobile API cache could total 25+ GB, a 5x increase. At cloud storage rates of $20-50/TB/month, this adds $500-1,000/month per TB of source data. Factor this into architecture decisions. The justification is usually compute savings: avoiding expensive joins saves more than the storage costs.
Debugging Stale Data
When users report seeing incorrect data, is the source wrong or is propagation lagging? Build observability: log event publish timestamps, consumer processing timestamps, and current source values alongside denormalized values. When investigating, query both source and denormalized stores with the same ID and compare. If source shows $15 but denormalized shows $10, the issue is propagation lag or a stuck consumer. Include version numbers to determine when the denormalized copy was last updated.
Handling Deletes
Deleting source data must propagate to all denormalized copies. If a product is discontinued but its denormalized listing remains, users see ghost products. The propagation pattern: source delete publishes a tombstone event (a marker indicating deletion). Consumers remove the denormalized row. If any consumer misses the tombstone, the ghost persists. Defensive design: include is_active flags rather than hard deletes, run periodic reconciliation jobs that compare source and denormalized stores and remove orphans.