Erasure Coding Failure Modes and Operational Challenges

Silent Data Corruption
Disks occasionally return wrong data without error. Cosmic rays flip bits. Firmware bugs corrupt sectors. A study of production systems found 1 in 10^14 bits corrupted silently per year. At petabyte scale, that is multiple corruption events annually.
Erasure coding detects corruption during decoding. The mathematical relationships between fragments fail if any fragment is corrupted. But detection only happens at read time. Data can sit corrupted for months until accessed. Some systems add checksums to each fragment, detecting corruption immediately on read without full decoding.
Correlated Failures
Durability calculations assume independent failures. Reality disagrees. A rack loses power, taking all disks with it. A firmware bug affects all drives of the same model. A network partition isolates all nodes in one datacenter. If your 14 fragments all live in one rack, one power failure loses everything.
Defense requires placement constraints. Fragments must span multiple failure domains: different racks, different power circuits, different datacenters. The placement system must understand physical topology and enforce distribution. This adds complexity and sometimes increases latency (cross rack writes are slower than same rack).
Slow Repair Death Spiral
When a disk fails, the system starts rebuilding lost fragments. Rebuilding consumes network bandwidth and disk IO. If rebuild is slow, more disks fail while rebuild is in progress. Each additional failure adds more rebuild work, slowing everything further. At some point, the system cannot keep up.
The critical metric is Mean Time To Repair (MTTR) versus Mean Time Between Failures (MTBF). If MTTR exceeds the window for tolerable failures, data loss becomes likely. A 10+4 scheme tolerating 4 failures with 24 hour repair time survives if fewer than 4 disks fail per day. At 10,000 disks with 2% annual failure, expect 0.5 failures per day on average, well within margin. But failure bursts happen.
Operational Mistakes
Human error causes more data loss than disk failure. Deleting the wrong data, misconfiguring replication, running out of space and having writes fail silently. Erasure coding protects against hardware failure, not against rm -rf on production.
Defense requires multiple layers: immutable backups, versioning, deletion delays, access controls, and tested restore procedures. Erasure coding is one layer of defense, not complete protection.
⚠️ Key Trade-off: Erasure coding protects against disk failures but not against correlated failures, slow repairs, or operational mistakes. Defense in depth requires geographic distribution, fast repair, and immutable backups.

💡 Key Takeaways

✓Silent data corruption (1 in 10^14 bits/year) is detected during decoding but can exist undetected until read

✓Correlated failures (rack power loss, firmware bugs) defeat durability assumptions requiring fragment placement across failure domains

✓Slow repair death spiral: if MTTR exceeds failure rate tolerance, rebuild cannot keep up and data loss becomes likely

✓At 10,000 disks with 2% annual failure rate, expect 0.5 failures/day average with dangerous failure bursts

✓Operational mistakes (wrong deletion, misconfiguration) cause more data loss than disk failures; erasure coding does not protect against human error

📌 Interview Tips

1When discussing durability, immediately caveat the eleven nines claim. That assumes independent failures, fast repairs, and no operational mistakes. Real durability is lower.

2Explain placement constraints concretely. A 14 fragment stripe should span at least 3 racks and ideally 2 datacenters. The placement service must track physical topology and enforce distribution.

3Ask about restore testing. Backups that have never been restored are not backups. Suggest quarterly restore drills to verify the recovery process actually works.

← Back to Erasure Coding & Durability Overview