Object Storage & Blob StorageErasure Coding & DurabilityHard⏱️ ~3 min

Erasure Coding Failure Modes and Operational Challenges

Correlated failures break the independence assumption underlying EC durability models. Rack or AZ outages, firmware bugs, power events, heat spikes, or batch manufacturing defects can take multiple shards in the same stripe down simultaneously. If you model failures as independent with 0.5% AFR but a rack failure takes 3 shards from the same 10+4 stripe offline at once, you have only 1 shard of tolerance remaining instead of 4. Mitigation requires placing shards across truly independent failure domains and explicitly modeling correlated failure probabilities rather than assuming Independent and Identically Distributed (IID) AFR. The repair window is the most critical operational parameter. Data loss occurs only when p+1 shards fail before repair completes. Long repair windows from limited bandwidth, stragglers, or throttling sharply reduce effective durability. Repair storms after large outages amplify this risk: if a rack fails taking 100 shards offline, you must repair all 100 within the window before additional p failures occur in any affected stripe. At 1 Gbps per repair stream, parallelism and bandwidth become the bottleneck. Systems must prioritize stripes with multiple existing failures first to reduce marginal risk. Silent data corruption from bit rot, media errors, or DMA bugs can poison both data and parity shards. If corrupted data is used to compute parity during writes or reconstruction during reads, the corruption spreads. Mitigation requires end to end checksums per chunk, regular scrubbing that reads, verifies, and repairs against parity, and cryptographic hashes to catch subtle corruption. Backblaze reported that scrubbing every shard quarterly at 17+3 catches and repairs corruption before it accumulates to p+1 shards. Small writes and partial stripe updates suffer severe RMW amplification. Updating 1 MB in an 8 MB chunk in a 10+4 stripe requires reading 10 data shards (80 MB), recomputing 4 parity shards (32 MB), and writing 1 data + 4 parity (40 MB), inflating 1 MB of user writes into 152 MB of IO. Tail latency explodes due to fanout. Systems mitigate by buffering to full stripes, log structuring, or tiering hot write heavy data with replication and migrating to EC after cooling.
💡 Key Takeaways
Correlated failures (rack, AZ outages, firmware bugs, power events) can take multiple shards from the same stripe offline simultaneously, violating independence assumptions and sharply reducing effective fault tolerance
Repair window directly controls durability: if p+1 shards fail before repair completes, data is lost; repair storms after large outages create high risk periods requiring prioritization of stripes with multiple failures
Silent data corruption from bit rot or media errors can poison parity; mitigation requires end to end checksums, regular scrubbing (Backblaze scrubs quarterly at 17+3), and cryptographic hashes
Small writes cause Read Modify Write (RMW) amplification: 1 MB update in 10+4 EC with 8 MB chunks generates 152 MB of IO (read 80 MB, compute 32 MB, write 40 MB) and explodes tail latency
Degraded reads and tail latency amplification: missing or slow shards force max of k fanout; for k=10, tail latency increases from per node p99 to composite p99 requiring per node p99.9 performance
Misplacement of multiple shards from same stripe within one failure domain violates resilience; requires placement auditors and anti affinity enforcement during writes and repairs
📌 Examples
Rack failure scenario: rack outage takes 3 nodes offline; if a 10+4 stripe has 3 shards on those nodes, fault tolerance drops from 4 to 1 shard immediately; another failure before repair causes data loss
Repair storm: 100 shards fail in rack outage; at 1 Gbps and 10 TB per shard, serial repair takes 100 × 22 hours = 91 days; must parallelize across 20+ streams to complete within 4 day window
Silent corruption: bit flip in data shard D1 goes undetected; D1 used to recompute parity P1 during write, poisoning P1; later reconstruction from D1 or P1 returns corrupted data; checksum per chunk catches this
RMW penalty: database with 4 KB random writes on 10+4 EC with 8 MB chunks; each 4 KB write triggers 80 MB read + 32 MB compute + 40 MB write = 152 MB IO, reducing effective throughput by 38,000x; solution: buffer to 8 MB or use replication
← Back to Erasure Coding & Durability Overview
Erasure Coding Failure Modes and Operational Challenges | Erasure Coding & Durability - System Overflow