Local Reconstruction Codes: Reducing Repair Bandwidth
The Repair Bandwidth Problem
Standard Reed Solomon coding has a painful property: repairing one lost fragment requires reading all k surviving data fragments. In a 10+4 scheme, fixing one dead disk means reading 10x the data being repaired. Repairing 1TB requires reading 10TB from other disks.
This creates network bottlenecks during repairs. A busy network slows repairs, extending the window where data is vulnerable to additional failures. Slow repairs reduce effective durability.
How LRC Reduces Repair Cost
LRC groups data fragments and adds local parity for each group. A 10+2+2 LRC scheme has 10 data fragments, 2 local parity fragments (each covering 5 data fragments), and 2 global parity fragments. Recovering one data fragment requires reading only 5 fragments from its local group instead of all 10.
Repair bandwidth drops by 2x in this example. For larger schemes like 16+4 with 4 local groups, repair bandwidth drops by 4x. Faster repairs mean shorter vulnerability windows and higher effective durability.
The Storage Overhead Trade off
Local parity fragments cost additional storage. A 10+4 scheme has 1.4x overhead. A 10+2+2 LRC scheme has 1.4x overhead too (14 total fragments for 10 data), but with different failure tolerance. LRC tolerates fewer simultaneous failures in exchange for faster single failure recovery.
The optimal scheme depends on failure patterns. Single disk failures are common; multiple simultaneous failures are rare. Optimizing for fast single failure recovery (LRC) often provides better real world durability than maximizing simultaneous failure tolerance (standard RS).
When LRC Makes Sense
Large scale storage systems with thousands of disks see constant single disk failures. A system with 10,000 disks and 2% annual failure rate sees 200 failures per year, roughly one every two days. Fast repair dominates durability calculations.