Replication Modes: Synchronous, Asynchronous, and Semi-Synchronous Trade-offs

Replication mode determines the fundamental consistency versus latency trade off in distributed databases. Synchronous replication blocks the leader's commit acknowledgment until one or more replicas confirm receipt and durability of the write. This guarantees that committed data exists on multiple nodes (Recovery Point Objective or RPO equals zero), but write latency increases by at least one Round Trip Time (RTT) to the slowest synchronous replica. Within an Availability Zone (AZ), this adds 0.1 to 0.5 milliseconds; cross AZ in the same region adds 1 to 2 milliseconds. Across geographic regions, synchronous replication adds 60 to 90 milliseconds for US East to West, 70 to 120 milliseconds transatlantic, or 120 to 200+ milliseconds transpacific. Additionally, synchronous replication reduces availability during network partitions because the leader cannot commit writes if it cannot reach a quorum of replicas.

Asynchronous replication decouples write latency from replication entirely. The leader acknowledges writes immediately after local commit, then streams changes to followers in the background. This delivers the lowest possible write latency and maintains availability even when replicas are unreachable. However, read your writes consistency can be violated (you might not see your own recent write if you read from a lagging replica), monotonic reads can fail (you might see newer data, then older data on subsequent reads), and RPO is greater than zero on failover (recent acknowledged writes may be lost if the leader fails before replication completes). Under healthy conditions, asynchronous cross region pipelines typically maintain steady state lag under 1 to 2 seconds, but lag can spike to tens of seconds or minutes during bursty writes, long running transactions, or partial network impairments.

Semi synchronous and quorum strategies occupy the middle ground. A common production pattern is to commit after achieving a local quorum (for example, 2 out of 3 replicas in the same region) to keep p99 write latency low (1 to 3 milliseconds), while streaming asynchronously to remote regions for disaster recovery. This provides strong consistency and low RPO within a region while maintaining high availability across regions. The key is matching replication mode to data criticality and acceptable latency. Critical metadata with low write Queries Per Second (QPS) such as user permissions or billing records may justify synchronous cross region replication despite the 100+ millisecond write penalty. High QPS data like content feeds, analytics events, or social activity streams require asynchronous replication to sustain throughput, accepting eventual consistency and bounded data loss windows.

💡 Key Takeaways

✓Synchronous cross region replication adds minimum 60 milliseconds (US East to West) up to 200+ milliseconds (transpacific) per write, making it viable only for low QPS critical metadata

✓Asynchronous replication delivers zero added write latency but introduces RPO greater than zero; a leader failure can lose all writes not yet replicated, typically 1 to 2 seconds of data under normal conditions

✓Semi synchronous quorum within a region (for example, wait for 2 of 3 local replicas) provides strong consistency and low RPO locally with only 1 to 3 milliseconds added latency while maintaining cross region availability

✓Write throughput limits: if leader write rate exceeds follower apply throughput by 20,000 operations per second, lag grows unbounded at that rate until backpressure or capacity increase occurs

✓Synchronous replication reduces availability because the system cannot commit writes during network partitions that isolate the leader from its quorum of replicas

✓Systems like Netflix use local quorums for low latency strong consistency within a region and accept asynchronous cross region replication with bounded staleness during regional failovers

📌 Interview Tips

1A payment processing system uses synchronous replication across 3 AZs in the same region (2 of 3 quorum) accepting 1 to 2 milliseconds added latency to guarantee zero data loss on single AZ failure, while using async replication to a disaster recovery region

2A social media feed uses fully asynchronous replication to sustain 500,000 writes per second across regions; users may see their own posts with 1 to 2 seconds delay when reading from a different region, which is acceptable for this use case

3Box's multitenant storage uses asynchronous replication to offload 75% of read traffic to replicas while implementing read after write guarantees via position based tokens, avoiding the write latency penalty of synchronous replication

← Back to Replication Lag & Solutions Overview