Compaction, Tombstones, and Read Path Performance

Compaction Strategies
LSM-Trees generate multiple SSTables as memtables flush to disk. Without compaction (background process merging files), reads would traverse hundreds of files. Two strategies dominate:
Size-tiered: Groups similarly-sized files into levels, creating overlapping files per level. Lower write amplification (5-10x bytes written vs logical writes) but higher read amplification (5-10 files checked per query). Best for write-heavy workloads.
Leveled: Maintains non-overlapping sorted runs per level. Higher write amplification (10-20x) but only 1-2 files per query. Best for read-heavy point lookups.
Tombstone Management
Deletes do not remove data immediately. Instead, a tombstone (deletion marker) is written. Tombstones persist until compaction merges files and confirms no older versions exist. This causes tombstone storms: queries scanning ranges traverse millions of tombstones before finding live data. A 1000-row scan might read 50K tombstones, spiking p99 latency from 10ms to 500ms+.
TTL (Time-To-Live) expirations generate tombstones too. If millions of records expire simultaneously (midnight rollover), tombstone storms cascade. Stagger TTL across hours and run incremental compactions frequently.
Compaction Stalls
When write throughput exceeds compaction capacity, pending tasks accumulate. Read amplification increases (more unmerged files to check), and eventually the system applies backpressure (refusing new writes to prevent disk exhaustion). Keep disk occupancy under 60-70% and provision 10K-50K IOPS per node for SSDs.
Key Insight: Compaction is continuous background work consuming 20-50% of cluster resources. Undersizing hardware causes elevated latencies, timeouts, and cascading failures.

💡 Key Takeaways

✓Size-tiered compaction: 5-10x write amplification, 5-10 files per read; best for write-heavy sustaining 100K+ writes/sec per node

✓Leveled compaction: 10-20x write amplification, 1-2 files per read; best for read-heavy point lookups needing sub-5ms p99

✓Tombstones from deletes/TTL persist until compaction; scanning 1000 rows can traverse 50K tombstones spiking latency 10-50x

✓TTL storms occur when millions of records expire simultaneously; stagger expiration windows and run frequent incremental compactions

✓Compaction stalls when writes exceed merge capacity; backpressure refuses writes when disk exceeds 70% to prevent exhaustion

✓Provision 20-50% extra IOPS headroom for compaction; undersized hardware causes cascading latency failures

📌 Interview Tips

1Compare strategies: size-tiered point query checks 8 files (bloom filters + 3 disk reads = 8-12ms). Leveled checks 2 files (bloom + 1 disk read = 2-3ms). Choose based on read/write ratio.

2Tombstone storm debugging: p99 jumps from 20ms to 5s at midnight. Check tombstone scan metrics - if showing 200K+ per query, TTL expiration is synchronized. Stagger TTL across 24 hours.

3Monitor compaction health: pending tasks >5 means falling behind, read amplification increasing means more files per query, disk >70% risks backpressure.

← Back to Wide-Column Stores (Cassandra, HBase) Overview