Replication & Consistency • Multi-Leader ReplicationHard⏱️ ~3 min
Implementation Details: Write Pipeline, Monitoring, and Capacity Planning
The write and replication pipeline in production multi leader systems follows a consistent pattern. Leaders persist writes locally to a write ahead log (WAL) for durability before acknowledging to clients, achieving single digit millisecond local latency. Each write is assigned version metadata including origin ID (which leader or Region), a monotonic logical sequence number per origin, and causality markers (vector clock entries or hybrid logical clock timestamps). Replication to peer leaders happens asynchronously and at least once, meaning the same operation may arrive multiple times during retries after transient network failures. This requires idempotency: each operation carries a unique operation ID, and replicas use upsert semantics or check operation IDs against a deduplication window (typically covering the maximum expected replication lag plus a safety margin, such as 10 minutes) to avoid applying duplicate operations.
Conflict detection happens on the apply path when the incoming write's causality marker does not descend from the local replica's current version for that key. For example, if the local replica has version vector {Tokyo: 5, London: 3} and receives a write from London with version {Tokyo: 4, London: 4}, neither version dominates (Tokyo 5 vs 4 and London 3 vs 4 are incomparable), indicating a conflict. The system then invokes the configured resolution strategy: for LWW, compare timestamps and keep the higher one plus a tiebreaker; for CRDTs, merge state (union for sets, max per origin for counters); for application merge, invoke custom logic. The merged result is persisted with an updated version vector that dominates both conflicting versions, ensuring convergence.
Monitoring and capacity planning are critical for operational health. Track replication lag per peer using sequence number distance rather than just time, because time based lag can hide queueing: a 100ms lag might represent 10 operations or 10,000 operations depending on write rate. Alert on p95 and p99 lag exceeding your staleness Service Level Objective (SLO), such as 500ms for social media timelines or 5 seconds for profile updates. Measure conflict rate per keyspace and per Region pair; spikes indicate hot keys, skewed routing, or clock issues. Observe end to end convergence time by writing a test record in Region A and polling Region B until visible, including payload size to correlate with WAN saturation. Capacity planning for cross Region transfer: egress roughly equals write rate times average item size times (number of Regions minus 1), plus multipliers for secondary indexes and change streams. A real example: 10,000 writes per second of 1 KB with 2 global secondary indexes across 3 Regions produces 10,000 × 1 KB × 2 peers × 3 (base plus two indexes) = 60 MB per second egress, approximately 5 TB per day. Provision WAN bandwidth with 50 to 100 percent headroom for p99 bursts and retransmissions; throttle replication to avoid starving foreground user traffic.
💡 Key Takeaways
•Leaders persist writes to write ahead log (WAL) before acknowledging clients, then assign per origin monotonic sequence numbers and causality metadata (vector clock or HLC) for conflict detection during asynchronous replication
•Replication is at least once requiring idempotency; operations carry unique IDs and replicas deduplicate within a window covering maximum lag plus margin (typically 10 minutes) to prevent duplicate side effects during retries
•Conflict detection compares causality markers on apply: if incoming version vector {Tokyo: 4, London: 4} and local version {Tokyo: 5, London: 3} are incomparable (neither dominates), a conflict exists requiring resolution logic
•Monitor replication lag as sequence number distance not just time lag; 100ms time lag at 100 ops per second is 10 operations behind but at 10,000 ops per second is 1,000 operations behind with very different recovery characteristics
•Cross Region egress capacity planning: baseline is write rate × item size × (Regions minus 1); multiply by (1 + number of global secondary indexes + change stream consumers); 10k writes per second × 1 KB × 2 peers × 3 (base plus 2 indexes) = 60 MB per second or 5 TB per day
•Alert on conflict rate spikes per keyspace indicating hot keys or clock skew; measure end to end convergence time (write in Region A, detect in Region B) correlated with payload size to identify WAN saturation before it impacts user experience
📌 Examples
AWS DynamoDB Global Tables replication: write to us-east-1 persisted to local write ahead log in under 5ms, assigned microsecond timestamp and Region origin ID, then streamed via DynamoDB Streams to replication workers; workers apply to eu-west-1 with idempotency check against 10 minute deduplication window using operation ID
Amazon Dynamo internal monitoring: each node tracks per peer sequence number watermarks; lag measured as local highest sequence minus peer's last applied sequence; alert fires when p99 lag exceeds 1000 operations or 5 seconds for any peer, triggering investigation of network saturation or downstream throttling
Capacity planning for e-commerce catalog: 50,000 product updates per second averaging 2 KB each across 4 Regions with 3 global secondary indexes produces 50k × 2 KB × 3 peers × 4 (base plus 3 indexes) = 1.2 GB per second cross Region egress; at $0.02 per GB inter Region transfer, approximately $2,000 per day in replication costs alone
Convergence time SLO: social media post written in us-west-2 must appear in timeline reads from ap-southeast-1 within p99 of 800ms; monitoring continuously writes test posts and measures cross Region visibility, alerting when p99 exceeds 800ms indicating replication lag or network degradation