Multi-Region Replication and Failure Modes

Active-Active Multi-Region
Multi-datacenter deployments replicate data across geographic regions for low-latency local reads and disaster recovery. In active-active topology, any region accepts writes with asynchronous replication to others. LOCAL_QUORUM (majority of local replicas) achieves 5-15ms write latency while surviving entire datacenter failures.
Storage overhead is significant. With replication factor 3 (RF=3, three copies per region) across 3 datacenters, each logical write becomes 9 physical writes. Total storage is 9x logical data plus 30% compaction headroom, roughly 12x total. Cross-region replication completes asynchronously in 50-200ms depending on WAN (Wide Area Network) distance.
Failure Modes
Split brain: Network partition isolates datacenters. Both accept writes independently using last-write-wins. When partition heals, conflicting writes resolve by timestamp, potentially losing updates if clocks skewed.
Replica lag: Asynchronous replication can fall 10+ seconds behind during peak load. Users read from local replica immediately after writing, see stale data (read-your-writes violation).
Cascading failure: One region fails, traffic shifts to remaining regions. Without 50-100% extra capacity per region, survivors overload, p99 spikes, timeouts cascade.
Anti-Entropy Repair
Replicas can diverge due to failed writes, network issues, or partial failures. Anti-entropy repair periodically compares Merkle trees (hash trees summarizing data segments) across replicas and streams differences. Full repair of multi-terabyte nodes takes hours to days, saturating network and disk. Throttle streams to 50-100 MB/s and run incremental repairs weekly to limit impact on foreground queries.
Key Trade-off: Multi-region active-active provides always-on operation at the cost of 9-12x storage overhead, eventual consistency windows of 50-200ms, and operational complexity of repair and conflict resolution.

💡 Key Takeaways

✓Active-active with LOCAL_QUORUM gives 5-15ms local writes while surviving datacenter failures; cross-region replication completes in 50-200ms

✓Storage overhead: 3 regions x RF=3 = 9x logical data, plus 30% compaction headroom = ~12x total physical storage

✓Split brain during partition allows both sides to accept writes; reconciliation uses last-write-wins which can lose updates if clocks skewed

✓Replica lag can reach 10+ seconds under load causing read-your-writes violations (user writes, then reads stale data from local replica)

✓Cascading failure when one region fails requires 50-100% extra capacity per surviving region to absorb redirected traffic

✓Anti-entropy repair compares Merkle trees and streams differences; throttle to 50-100 MB/s to avoid impacting foreground queries

📌 Interview Tips

1Calculate storage overhead: 100GB logical data, 3 regions, RF=3 = 900GB replicated + 270GB compaction headroom = 1.17TB total. Budget 12x logical for multi-region.

2Explain read-your-writes violation: user posts comment (LOCAL_QUORUM succeeds in 5ms), immediately refreshes page (reads local replica still 15s behind). Comment not visible.

3Design for region failure: 3 regions each sized for 40% traffic with 60% headroom. One fails, others absorb 50% each, hitting 100% capacity. Provision 100% headroom to survive gracefully.

← Back to Wide-Column Stores (Cassandra, HBase) Overview