Object Storage & Blob Storage • Erasure Coding & DurabilityHard⏱️ ~3 min
Erasure Coding Read and Write Path: Performance Trade-offs
The write path for EC differs significantly based on write size. Full stripe writes are optimal: compute all p parity shards from k data shards and write all n shards in parallel, achieving best throughput. For partial updates affecting fewer than k data shards, you must perform a Read Modify Write (RMW): read the k data shards involved in the stripe, recompute parity shards, then write back modified data plus parity. This RMW inflates IO by 2x to 3x and increases tail latency dramatically. Systems mitigate this by buffering small writes to coalesce into full stripes or using log structured approaches.
The read path has two modes. Healthy reads with systematic MDS codes like Reed Solomon read data shards directly without decoding, matching replication performance. Degraded reads when shards are missing or slow must read any k available shards and reconstruct through decoding. This creates a max of k fanout problem: if each node has p99 latency, the composite latency for reading k nodes approximates the maximum of k independent samples. To achieve composite p99, each node must meet roughly p99 plus log10(k) tail performance. For k=10, you need per node p99.9 to achieve composite p99.
Repair bandwidth directly impacts durability. Replacing 10 TB of data at 1 Gigabit per second (Gbps) or 125 Megabytes per second (MB/s) takes 22 to 24 hours of sustained transfer. At 10 Gbps, this drops to 2 to 2.5 hours. Longer repair windows exponentially increase the probability that p+1 failures occur before repair completes, sharply reducing durability. To maintain 11 nines durability for 100 Petabytes (PB) with 0.5% AFR, you need bandwidth to rebuild approximately 0.5 PB per day during steady state plus surge capacity for failure bursts.
Shard placement across failure domains is critical. Place each shard of a stripe in distinct failure domains: different disks, controllers, nodes, racks, or Availability Zones. Cross AZ or cross region placement improves durability against correlated failures like rack outages or power events, but increases read latency and egress costs. Many production systems place at least one shard per AZ across three or more AZs. The tradeoff is durability and blast radius isolation versus latency and cross domain bandwidth costs.
💡 Key Takeaways
•Full stripe writes compute p parity shards and write all n shards in parallel for best throughput; partial updates require Read Modify Write (RMW) reading k data shards, recomputing parity, and writing back
•Healthy reads with systematic codes access data shards directly; degraded reads must fanout to k shards and decode, amplifying tail latency from per node p99 to approximately max of k samples
•For k=10 reconstruction, achieving composite p99 latency requires each node to meet roughly p99.9 tail performance due to max of k fanout amplification
•Repair bandwidth directly controls durability: 10 TB rebuild at 1 Gbps takes 22 to 24 hours versus 2 to 2.5 hours at 10 Gbps; longer windows exponentially increase p+1 failure risk
•Shard placement across failure domains (disks, nodes, racks, AZs) improves durability against correlated failures but increases cross domain latency and egress bandwidth costs
•Production systems place at least one shard per AZ across three or more AZs; wider placement trades latency and cost for better blast radius isolation
📌 Examples
RMW example: updating 1 MB in a 6+3 stripe with 8 MB chunks requires reading all 6 data shards (48 MB), recomputing 3 parity shards (24 MB), and writing 1 data + 3 parity (32 MB) for 104 MB total IO
Tail latency: in a 10+4 EC scheme, if each node has 50ms p99 latency, the degraded read fanning out to 10 nodes sees composite p99 closer to 80 to 100ms due to taking the maximum
Repair bandwidth budget: maintaining 11 nines for 100 PB with 0.5% AFR requires rebuilding approximately 0.5 PB per day or 6 GB/s aggregate repair bandwidth across the cluster
Cross AZ placement: placing 8+4 shards across 4 AZs (3 shards per AZ) tolerates losing one full AZ (3 shards lost, 1 shard of tolerance remaining) but incurs cross AZ latency of 1 to 5ms and egress costs