Compaction, Tombstones, and Read Path Performance
Compaction Strategies
LSM-Trees generate multiple SSTables as memtables flush to disk. Without compaction (background process merging files), reads would traverse hundreds of files. Two strategies dominate:
Size-tiered: Groups similarly-sized files into levels, creating overlapping files per level. Lower write amplification (5-10x bytes written vs logical writes) but higher read amplification (5-10 files checked per query). Best for write-heavy workloads.
Leveled: Maintains non-overlapping sorted runs per level. Higher write amplification (10-20x) but only 1-2 files per query. Best for read-heavy point lookups.
Tombstone Management
Deletes do not remove data immediately. Instead, a tombstone (deletion marker) is written. Tombstones persist until compaction merges files and confirms no older versions exist. This causes tombstone storms: queries scanning ranges traverse millions of tombstones before finding live data. A 1000-row scan might read 50K tombstones, spiking p99 latency from 10ms to 500ms+.
TTL (Time-To-Live) expirations generate tombstones too. If millions of records expire simultaneously (midnight rollover), tombstone storms cascade. Stagger TTL across hours and run incremental compactions frequently.
Compaction Stalls
When write throughput exceeds compaction capacity, pending tasks accumulate. Read amplification increases (more unmerged files to check), and eventually the system applies backpressure (refusing new writes to prevent disk exhaustion). Keep disk occupancy under 60-70% and provision 10K-50K IOPS per node for SSDs.