Object Storage & Blob StorageErasure Coding & DurabilityMedium⏱️ ~3 min

Choosing Erasure Coding Schemes: k, p, and Stripe Geometry

Selecting k (data shards) and p (parity shards) balances storage overhead, fault tolerance, latency, and repair cost. Storage overhead is p/k: 4+2 has 50% overhead, 10+4 has 40%, and 17+3 has 17.6%. Larger k lowers overhead but worsens read fanout during degraded mode and increases RMW cost for partial writes. Smaller k improves latency and simplifies operations at the cost of higher overhead. Common production patterns are 4+2, 6+3, 8+4, 10+4, and 17+3, chosen based on workload characteristics. Fault tolerance is p: you can lose up to p shards without data loss. Higher p increases durability but also increases overhead. The key insight is that durability depends not just on p but on repair window and failure rate. A 6+3 scheme with fast repairs (2 hour window) can achieve higher durability than 10+4 with slow repairs (48 hour window), because the probability of p+1 failures within 2 hours is far lower than within 48 hours, even though 10+4 has higher nominal fault tolerance. Chunk size (the size of each shard, typically 1 to 8 MB) trades metadata overhead against parallelism. Larger chunks reduce metadata (fewer stripes to track) and syscall overhead but decrease parallel IO opportunities and waste space for small objects. Smaller chunks improve failure isolation (one bad sector corrupts less data) and pipeline depth but increase metadata cost. Most production systems use 4 to 8 MB chunks. Stripe alignment matters: objects much smaller than k × chunk size suffer from partial stripes with wasted padding. Practical decision rules: Use 3x replication for hot tier, small objects (under 1 MB), low latency requirements (p99 under 10ms), and write heavy workloads. Use 6+3 or 8+4 EC for warm tier, medium objects (1 to 100 MB), moderate latency tolerance (p99 under 50ms), and read heavy or balanced workloads. Use 10+4, 12+4, or 17+3 EC for cold tier, large objects (over 100 MB), high latency tolerance (p99 over 100ms), and append only archives. Keep k under 10 for latency sensitive reads to limit tail amplification; larger k is acceptable for cold archives.
💡 Key Takeaways
Storage overhead is p/k: 4+2 has 50%, 10+4 has 40%, 17+3 has 17.6%; larger k reduces overhead but increases read fanout latency and RMW costs for small writes
Fault tolerance p controls how many shards can fail; but effective durability depends on repair window: 6+3 with 2 hour repairs beats 10+4 with 48 hour repairs due to lower p+1 failure probability
Common production schemes: 4+2 and 6+3 for warm tier, 8+4 and 10+4 for balanced warm/cold, 12+4 and 17+3 for cold archives with large objects
Chunk size (1 to 8 MB typical) trades metadata overhead against parallelism; larger chunks reduce metadata but decrease parallel IO; most systems use 4 to 8 MB chunks
Decision rule: 3x replication for hot tier, small objects (under 1 MB), p99 under 10ms; 6+3 or 8+4 EC for warm tier, 1 to 100 MB objects, p99 under 50ms; 10+4 or 17+3 for cold tier, over 100 MB, p99 over 100ms
Keep k under 10 for latency sensitive reads to limit max of k tail amplification; larger k acceptable for cold archives where throughput matters more than latency
📌 Examples
Hot tier example: 3x replication for 10 KB to 1 MB objects with p99 under 5ms for user facing reads; EC RMW would inflate latency to 50+ ms
Warm tier example: 8+4 EC for 10 MB objects with p99 under 30ms; storage overhead 50% versus 200% for replication; acceptable degraded read latency
Cold tier example: 17+3 EC for 1 GB video files with p99 under 200ms; storage overhead 17.6%; reads are sequential and latency insensitive; optimizing for $/TB
Chunk size tradeoff: 1 MB chunks for 10 MB objects create 10 stripes with high metadata; 8 MB chunks create 1.25 stripes with padding waste; 4 MB is a balanced choice
← Back to Erasure Coding & Durability Overview
Choosing Erasure Coding Schemes: k, p, and Stripe Geometry | Erasure Coding & Durability - System Overflow