Replication & Consistency • Leader-Follower ReplicationEasy⏱️ ~3 min
Leader-Follower Replication Architecture and Write Path
Leader follower replication is a distributed data architecture where a single leader node coordinates all writes to enforce a total order of updates, while follower nodes replicate those changes to serve read traffic and provide redundancy. When a client submits a write, it goes exclusively to the leader, which appends the operation to a replication log with a monotonically increasing Log Sequence Number (LSN) or offset, persists it locally (typically with fsync), and then ships the log entry to followers. Followers apply these entries in the exact order received to converge to the leader's state. This design eliminates write conflicts by construction since only one node accepts writes.
The critical control flow is: client writes to leader → leader appends to log and persists locally → leader ships log entries to followers → followers apply in order → leader declares the write committed based on acknowledgment policy. Reads can be served from the leader for strong consistency or from followers for read scalability, accepting potential staleness equal to the replication lag (typically milliseconds on healthy networks, but can stretch to seconds under load). At production scale, systems partition data into shards, each with its own leader and follower set. For example, Kafka partitions topics where each partition has one leader handling writes at rates exceeding 100,000 messages per second per partition, while Elasticsearch shards primary replicas that commonly process over 100,000 documents per second across a cluster.
The leader must handle several responsibilities beyond just processing writes: tracking which followers are caught up, deciding when a write is durable enough to acknowledge to clients, detecting failed followers, and coordinating log compaction or cleanup. Followers must handle applying changes efficiently, detecting when they fall too far behind, and participating in leader election when the current leader fails. This clear separation of responsibilities makes the system simpler to reason about compared to multi leader or leaderless designs, but creates a single point of coordination that can become a bottleneck.
💡 Key Takeaways
•Single leader per shard enforces total write order and eliminates conflicts by construction, trading write scalability for simplicity. Systems like MongoDB and PostgreSQL use one leader per dataset or shard.
•Leader assigns monotonically increasing LSN or offset to each write, persists locally, then replicates to followers. Kafka leaders handle over 100,000 messages/sec per partition with this model.
•Followers apply log entries in exact order to converge to leader state. Application lag typically ranges from sub millisecond on LAN to seconds under heavy load or network issues.
•Reads from leader provide strong consistency while reads from followers trade freshness for scalability. Elasticsearch clusters commonly serve 10x more read traffic from replicas than from primaries.
•Horizontal write scaling requires sharding data across multiple leaders, each managing a subset of keys. This works well for partitionable workloads but complicates cross shard transactions.
•Leader is single point of coordination for its shard. If leader fails, writes halt until new leader is elected, typically taking 5 to 30 seconds including detection, election, and client rerouting in production systems.
📌 Examples
PostgreSQL streaming replication: Primary appends to Write Ahead Log (WAL), walreceiver process on standby fetches WAL segments over TCP connection, startup process applies changes to data files. With synchronous_commit set to 'on' and one synchronous standby, commit latency increases by one Round Trip Time (RTT) to standby, typically 1 to 5 ms within same datacenter.
Kafka partition leadership: Each partition has one leader broker and multiple follower brokers forming In Sync Replicas (ISR). Producer sends record to leader, leader appends to local log, followers fetch via replication protocol. With acks=all configuration, leader waits for all ISR to acknowledge before confirming to producer. LinkedIn reported handling millions of messages per second across clusters with replication factor 3 and intra datacenter p99 latency in tens of milliseconds.
MongoDB replica set: Primary accepts writes and appends to oplog (operations log), secondaries tail oplog and apply operations. With writeConcern majority, primary waits until majority of voting members have replicated before acknowledging. Default election timeout is 10 seconds; typical failover completes in 10 to 20 seconds including detection and new primary stepping up.