Replication & ConsistencyReplication Lag & SolutionsHard⏱️ ~3 min

Cross Region Replication: Network Realities and Capacity Planning

Cross region replication introduces network distance as the dominant factor in both steady state lag and recovery time from lag spikes. Network Round Trip Time (RTT) varies dramatically by geography: within an Availability Zone (AZ) RTT is 0.1 to 0.5 milliseconds; cross AZ in the same region is 1 to 2 milliseconds; US East to West is 60 to 90 milliseconds; transatlantic is 70 to 120 milliseconds; transpacific is 120 to 200+ milliseconds. These RTTs are physical limits; no software optimization can reduce them. For asynchronous replication, these delays add directly to minimum achievable lag. Even with infinite bandwidth and zero processing time, a write committed on the US East Coast cannot be visible on the West Coast in less than 60 milliseconds due to speed of light limitations in fiber optic cables. Bandwidth and throughput constraints create the second critical planning factor. If your leader generates 200 Megabytes per second of write log output and you apply 2x compression, you still need at least 800 Megabits per second (100 Megabytes per second) of effective network capacity to each replica region. With three downstream regions, aggregate egress must handle at least 2.4 Gigabits per second plus protocol overhead (typically 10 to 20% for TCP/IP, encryption, and acknowledgments). Cloud provider cross region data transfer costs are substantial: AWS charges approximately $0.02 per Gigabyte for inter region transfer. At 200 Megabytes per second (17.28 Terabytes per day), egress to three regions costs approximately $1,000 per day or $30,000 per month just for replication bandwidth. Capacity planning requires calculating drain time: the time to recover from a backlog. If the leader sustains 300 Megabytes per second write rate for 10 minutes due to a batch job, it generates 180 Gigabytes of backlog. Over a 1 Gigabit per second (125 Megabytes per second) sustained link, draining 180 Gigabytes takes 1,440 seconds or 24 minutes, during which replication lag is at least 24 minutes for followers in that region. If your Service Level Objective (SLO) for cross region lag is 10 seconds at p99, you must either provision 10x the average bandwidth (10 Gigabits per second capable links) to handle bursts, implement strict write throttling to cap burst rates, or accept SLO violations during batch operations. Production systems often use a combination: provision 3x to 5x average capacity, implement backpressure at 2x average write rate, and schedule bulk operations during low traffic periods with explicit lag budget allowances.
💡 Key Takeaways
Network RTT creates physical minimum lag: US East to West writes cannot appear in under 60 milliseconds due to speed of light in fiber; transpacific is 120 to 200+ milliseconds minimum
Bandwidth requirements scale with regions: 200 Megabytes per second log generation to 3 regions requires 2.4+ Gigabits per second aggregate egress capacity with compression, costing approximately $30,000 per month in cloud transfer fees
Drain time formula: backlog bytes divided by effective throughput equals recovery time; 180 Gigabytes over 1 Gigabit per second takes 24 minutes, setting minimum lag during recovery
Production systems provision 3x to 5x average bandwidth to handle burst writes without violating lag SLOs; 10x provisioning enables sub second recovery from typical bursts but increases cost proportionally
Write rate exceeding follower apply throughput causes unbounded lag growth; if leader sustains 50,000 operations per second but follower can only apply 30,000 operations per second, lag grows by 20,000 operations per second indefinitely
Partial network impairments (packet loss reducing effective throughput from 1 Gigabit per second to 200 Megabits per second) grow lag silently over hours before crossing alert thresholds; monitor both lag and link utilization metrics
📌 Examples
Netflix provisions cross region replication with sufficient capacity to maintain under 5 second lag during regional traffic failovers when write rates spike 3x to 5x due to rerouted traffic from a failing region
A global SaaS provider generates 150 Megabytes per second of write log on average; during month end reporting batch jobs, writes spike to 600 Megabytes per second for 15 minutes; with 5x provisioned capacity (750 Megabytes per second links), lag stays under 30 seconds; without headroom, lag would exceed 10 minutes
An e-commerce platform replicating from US to EU and APAC regions calculates 17.28 Terabytes per day at 200 Megabytes per second; cross region egress to 2 regions costs $20,000 per month at $0.02 per Gigabyte, driving a decision to replicate only critical transaction data cross region and keep analytics logs regional
← Back to Replication Lag & Solutions Overview