Distributed Systems Primitives • Consensus Algorithms (Raft, Paxos)Hard⏱️ ~3 min
Production Implementation: Write Path, Reads, and Operational Observability
The steady state write path in both Raft and Multi Paxos follows a carefully orchestrated sequence designed to balance safety, latency, and throughput. When a client submits a write, the leader first appends the entry to its local write ahead log, performs fsync to ensure durability, then sends AppendEntries or accept messages to all followers in parallel. The leader commits the entry once a majority have acknowledged persistence, then returns success to the client and asynchronously notifies followers of the new commit index so they can apply the entry to their state machines. Pipelining multiple client requests before waiting for acknowledgments and batching entries to amortize fsync costs are critical optimizations; without batching, throughput is limited to approximately 500 to 1000 operations per second on typical hardware due to fsync latency, but with 10 to 50 millisecond batch windows, throughput can reach 5000 to 10000 operations per second while keeping p99 latency acceptable.
Read path implementation depends on consistency requirements and directly impacts both latency and availability. Linearizable reads require proving the leader still holds authority, typically via a read index mechanism where the leader confirms with a quorum that no new leader has been elected, or via leader leases where the leader can serve reads locally as long as its lease has not expired and clocks are sufficiently synchronized. Google Spanner uses TrueTime bounded uncertainty windows to enable linearizable reads without quorum checks, relying on GPS and atomic clock hardware to bound clock skew to single digit milliseconds. For applications tolerating bounded staleness, follower reads can be served with much lower latency by allowing any replica to serve data with a timestamp guarantee, such as reads at least 5 seconds old, trading recency for geographic distribution and load balancing.
Operational observability is essential for maintaining production consensus clusters. Key metrics include commit latency at p50, p95, and p99 percentiles, with alerts on p99 exceeding 100 milliseconds in local area networks or 500 milliseconds cross region indicating degradation. Election rate should remain under 1 per minute in steady state; frequent elections indicate timeouts from garbage collection pauses, slow disks, or network issues. Track fsync latency separately with alerts on p99 exceeding 50 milliseconds suggesting disk degradation. Monitor replication lag between leader and followers, with lag exceeding 10 seconds indicating serious issues requiring manual intervention. Snapshot and compaction time should be tracked to ensure background operations do not interfere with foreground latency. Applied index versus committed index reveals if state machines are falling behind in applying committed entries, causing memory pressure from buffering.
💡 Key Takeaways
•Steady state writes follow leader append and fsync, parallel follower replication, commit at majority acknowledgment, then client response, with pipelining and batching essential for throughput beyond 500 to 1000 operations per second
•Batching entries into 10 to 50 millisecond windows amortizes fsync cost to achieve 5000 to 10000 operations per second sustained, but longer batches degrade p99 latency causing client timeout cascades
•Linearizable reads require read index quorum checks or leader leases with bounded clock skew; Google Spanner uses TrueTime with GPS and atomic clocks to bound uncertainty to single digit milliseconds enabling local reads
•Follower reads with bounded staleness, such as at least 5 seconds old, provide lower latency and geographic distribution for read heavy workloads without contacting the leader or requiring quorum checks
•Critical observability metrics include p99 commit latency with 100 milliseconds local area network and 500 milliseconds cross region thresholds, election rate under 1 per minute, and fsync p99 under 50 milliseconds
•Log compaction and snapshot generation must be rate limited and isolated from foreground I/O paths; heavy compaction can delay leader heartbeats triggering elections and causing write unavailability
📌 Examples
etcd implements batching by accumulating entries for up to 10 milliseconds or until 10000 entries before performing fsync and replication. This increases throughput from approximately 1000 to 5000 operations per second while keeping p99 latency under 50 milliseconds in low latency environments. The batch interval is tunable based on latency sensitivity.
Google Chubby heavily optimizes for reads through client side caching with leases. Clients cache lock ownership and file metadata with renewable leases, avoiding master contact for repeated reads. Only lock acquisitions, releases, and lease renewals contact the master, keeping it available despite serving thousands of clients. Cache invalidations are pushed via keepalive messages.
A production CockroachDB cluster monitoring dashboard tracks per range commit latency heatmaps revealing hot ranges with elevated p99. Operators use this to identify and split contended ranges, such as monotonically increasing primary keys causing single range bottlenecks. Replication lag graphs show when followers fall behind due to snapshots or slow disks, triggering investigations before data loss occurs.