Database DesignNormalization vs DenormalizationHard⏱️ ~3 min

Consistency Models and Convergence in Denormalized Systems

The Staleness Problem

Denormalized data becomes stale the moment its source changes. If a product price updates at 10:00:00 but the denormalized product listing cache updates at 10:00:05, users see the old price for 5 seconds. This staleness window is unavoidable in async propagation. The design question: what staleness can your domain tolerate? Social feeds can accept seconds to minutes. Price displays might tolerate seconds. Inventory availability for checkout should be real-time (no denormalization).

Sync vs Async Propagation

Synchronous updates (update denormalized copies in the same transaction as source) guarantee consistency but add latency and failure modes. If updating 10 denormalized stores per write, any failure rolls back the entire transaction. Asynchronous updates (publish change events, consumers update copies) are faster and more resilient but introduce staleness. Most production systems choose async with staleness SLOs: target 95% of changes visible within 5 seconds, 99% within 30 seconds.

Change Data Capture Pattern

Change Data Capture (CDC) reads the database transaction log to publish change events. Every committed write to the normalized table automatically becomes an event in a message stream. Consumers process events and update denormalized stores. This is more reliable than application-level dual writes because the database guarantees the log captures all changes. If the consumer crashes, it resumes from its last checkpoint. Typical propagation lag: 100-500 ms under normal load, seconds during spikes.

Handling Out-of-Order Updates

Distributed message systems may deliver events out of order. If price changes from $10 → $15 → $12, events might arrive as $15, $10, $12. Without versioning, the denormalized store shows $10 (last processed) instead of $12 (actual). Solution: include a version number or timestamp in each event. Consumers only apply updates with version greater than current. Older events are discarded. This makes consumers idempotent: processing the same event twice produces the same result.

💡 Key Takeaways
Eventual consistency staleness budget must be explicit: Meta and Pinterest target 95% of changes visible within 5 seconds, 99% within 30 seconds; measure end to end lag from source event timestamp to projection apply time
Cross region replication lag consumes budget: 100 to 300 milliseconds at p95 for inter region replication plus queueing delay leaves little room for processing; requires small batch sizes (10 to 100 events) and parallel consumers to stay under 5 second p95 SLO
Idempotent updates with monotonic versioning prevent divergence: consumers apply only newer versions and discard stale out of order events; for counters, shard increments across 64 to 128 buckets and reconcile periodically every 10 to 60 minutes to correct drift (typically low single digit percent without reconciliation)
Dual write anomalies from synchronous updates to source and projection on partial failure require outbox pattern: write source and event to outbox table in single transaction, separate process publishes to change stream guaranteeing at least once delivery
Read your writes consistency across regions needs sticky routing (read from write region) or version tokens (client passes write version, read blocks until projection version catches up); alternative is explicit staleness indicators in User Interface showing data freshness
📌 Interview Tips
1Meta feed convergence: change data capture from normalized social graph publishes to Kafka; consumers apply to per user denormalized feed rows with version numbers; end to end lag monitored at p50 (under 1 second), p95 (under 5 seconds), p99 (under 30 seconds); periodic reconciliation jobs scan for drift and repair from event log
2Pinterest homefeed pipeline: outbox pattern on normalized pin/board updates ensures all changes captured; Kafka consumers process in parallel per user shard; idempotent upserts using entity version prevent out of order application; staleness SLO of 5 seconds p95 met by tuning batch size to 50 events and provisioning 500 consumer instances
← Back to Normalization vs Denormalization Overview