Database DesignWide-Column Stores (Cassandra, HBase)Hard⏱️ ~3 min

Multi-Region Replication and Failure Modes

Multi datacenter deployments replicate data across geographic regions for low latency local reads and disaster recovery. Active active topology allows writes in any region with asynchronous replication to others, keeping latency low (single digit milliseconds locally) while surviving entire datacenter failures. Each region runs an independent replication factor (commonly 3 per region), so logical write with replication factor 3 across 3 datacenters becomes 9 physical writes (3x write amplification). Conflict resolution uses last write wins based on timestamps, requiring tight Network Time Protocol (NTP) discipline with clock skew under 100 ms to avoid lost updates. Netflix runs hundreds of clusters across multiple Amazon Web Services (AWS) regions handling trillions of operations per day with LOCAL_QUORUM consistency (2 of 3 local replicas) giving 5 to 15 ms p99 write latency and sub 10 ms reads. Cross region replication completes asynchronously in 50 to 200 ms depending on Wide Area Network (WAN) distance. Total storage is 9x logical data (3 regions times 3 replicas) plus compaction headroom. Apple iCloud similarly uses multi region active active with tens of thousands of nodes and 100+ petabytes stored. Failure modes include split brain scenarios where network partitions isolate datacenters, both sides accept writes independently, and reconciliation discovers conflicting data. Read replica lag in asynchronous replication can fall 10+ seconds behind during peak load or network congestion, causing users to see stale data immediately after writes (read your writes violations). Cascading failures occur when one region fails and traffic shifts to remaining regions, overloading them if capacity headroom is insufficient (commonly need 50 to 100 percent extra capacity per region to absorb failures). Anti entropy repair keeps replicas synchronized by periodically comparing Merkle tree hashes across replicas and streaming differences. Full repair of multi terabyte nodes can take hours to days, saturating network and disk, causing elevated tail latencies if not throttled properly. Netflix and Apple automate repair with incremental token range based repairs run weekly, throttled to 50 to 100 MB/s per stream to limit impact on foreground traffic.
💡 Key Takeaways
Active active multi region with LOCAL_QUORUM gives 5 to 15 ms local writes and survives datacenter failures, but requires 9x storage (3 regions times replication factor 3) and 9 physical writes per logical write
Asynchronous cross region replication completes in 50 to 200 ms depending on WAN distance; read replica lag can reach 10+ seconds under load causing stale reads after writes
Last write wins conflict resolution requires tight NTP synchronization with clock skew under 100 ms; skew causes lost updates when concurrent writes have reversed timestamps
Split brain during network partition allows both sides to accept writes independently; reconciliation discovers conflicts resolved by timestamp ordering potentially losing updates
Cascading failures occur when one region fails and traffic shifts to others; requires 50 to 100 percent extra capacity headroom per region to absorb failure load without overload
Anti entropy repair streams multi terabyte datasets taking hours to days; Netflix and Apple throttle streams to 50 to 100 MB/s and run incremental repairs weekly to limit impact on foreground queries
📌 Examples
Netflix multi region at scale: Hundreds of clusters across AWS regions handling 10+ million ops/sec with sub 20 ms p99, using LOCAL_QUORUM per region and async replication. Each logical write becomes 9 physical writes (3 DCs times RF=3), storage overhead 9x plus 30 percent compaction headroom totals 12x.
Split brain incident: Network partition isolates US and EU datacenters for 5 minutes. Both accept writes independently. User updates profile in US (timestamp 10:05:00.100), same user updates in EU (timestamp 10:05:00.050 due to clock skew). Partition heals, last write wins chooses US update as newer, EU update lost.
Read replica lag violation: User posts comment, LOCAL_QUORUM write succeeds in US-East in 5 ms. Async replication to EU-West replica delayed 15 seconds due to WAN congestion. User in EU-West reads from local replica immediately, does not see own comment for 15 seconds.
Cascading failure from region loss: Three region deployment (US, EU, Asia) each sized for 40 percent of global traffic with 60 percent peak headroom. US region fails, 40 percent traffic shifts to EU and Asia. EU now serves 60 percent (40 original plus 20 shifted), hitting 100 percent capacity, p99 latency spikes, timeouts cascade.
← Back to Wide-Column Stores (Cassandra, HBase) Overview