Replication Factor and Acknowledgment Modes: Durability vs Latency Tradeoffs

Replication for Durability:

Every partition replicates across multiple brokers for durability, with replication factor 3 being the production standard across failure domains (racks or Availability Zones). One replica is elected leader and handles all reads and writes; followers replicate asynchronously. The set of followers caught up within configurable lag thresholds forms the In-Sync Replica (ISR) set.

Acknowledgment Semantics:

Producers configure acknowledgment semantics to trade latency for safety.

Leader Only (acks=1)
5-10ms latency, risks data loss if leader fails
vs
Quorum (acks=all)
10-20ms latency, durable if any ISR survives

Cross-AZ replication adds 1 to 5ms; cross-region adds 50 to 200ms driven by WAN round trip time.

ISR Shrink Failure Mode:

If followers fall behind due to network issues or overload, they drop out of ISR. When ISR drops below the minimum (often 2 out of 3), writes with quorum acks block entirely, triggering producer timeouts and unavailability. Monitor ISR count per partition and under-replicated partition metrics obsessively. The alternative, unclean leader election (promoting an out-of-sync replica), can truncate already-acknowledged writes.

Production Defaults:

Most teams accept quorum acks and size clusters to maintain ISR health under normal load spikes. LinkedIn and Netflix both use replication factor 3 with quorum acks as the default, prioritizing correctness over the last few milliseconds of latency.

💡 Key Takeaways

✓Replication factor 3 across failure domains is the standard: survives single broker or availability zone loss. Higher factors (5) are rare due to cost and write amplification, used only for critical metadata topics.

✓ISR health determines availability: when ISR shrinks below minimum in sync replicas setting (typically 2), quorum writes block. Monitor under replicated partitions and ISR delta metrics; alert on any partition with ISR below replication factor for more than 60 seconds.

✓Cross availability zone latency is 1 to 5 milliseconds added; cross region is 50 to 200 milliseconds. Many teams run regional clusters and replicate asynchronously across regions to avoid synchronous wide area network latency penalties on every write.

✓Unclean leader election trades availability for correctness: promoting an out of sync replica can truncate acknowledged data. Disable unclean election in production for topics requiring durability (financial transactions, orders); accept unavailability until an ISR replica recovers.

✓Idempotent producers prevent duplicates on retry: enabling idempotence assigns sequence numbers per producer, allowing brokers to deduplicate retries after transient network failures. This is orthogonal to acks but pairs well with quorum mode for exactly once semantics.

📌 Interview Tips

1A financial services platform uses quorum acks with idempotent producers and disables unclean election for payment event topics. They accept 15 to 20 millisecond p99 produce latency in exchange for zero data loss guarantees, even during availability zone failures.

2Netflix configures replication factor 3 with quorum acks as default. During a datacenter network partition, ISR shrunk to 1 on some partitions; writes blocked for 90 seconds until network healed and followers rejoined ISR, preventing any data loss.

3OpenAI runs multiple primary clusters with leader only acks to maximize availability during primary failovers. They accept rare duplicates when unions of sources overlap and handle deduplication in downstream sinks using idempotency keys.

← Back to Kafka/Event Streaming Architecture Overview