Database DesignWide-Column Stores (Cassandra, HBase)Hard⏱️ ~3 min

Compaction, Tombstones, and Read Path Performance

LSM tree storage generates multiple sorted files over time as memtables flush to disk. Background compaction processes merge these files, purge deleted data (tombstones), and reclaim space. Without compaction, reads would traverse hundreds of files causing seconds of latency. Two main strategies exist: size tiered compaction creates multiple overlapping files per level (lower write amplification, better for write heavy workloads) and leveled compaction maintains non overlapping sorted runs (lower read amplification, better for read heavy point queries). Write amplification is the ratio of bytes written to disk versus logical bytes written (commonly 5x to 20x depending on strategy), while read amplification is the number of files checked per query (1 to 2 for leveled, 5 to 10 for size tiered). Tombstones mark deleted data but persist until compaction merges files and determines no older versions exist. Frequent deletes or Time To Live (TTL) expiry generate tombstone storms where scans traverse millions of tombstones, causing p99 read latency to spike 10x to 100x. A query scanning 1000 rows might read 50K tombstones along the way if deletions are heavy. Tombstone thresholds (commonly 100K per query) trigger warnings, and large scans can time out entirely. Compaction stalls occur when write throughput exceeds compaction capacity. Pending compactions accumulate, read amplification increases (more unmerged files to check), and eventually the system applies backpressure refusing new writes. Netflix provisions disk Input/Output Operations Per Second (IOPS) and keeps disk occupancy under 60 to 70 percent to maintain compaction headroom. Compaction consumes 20 to 50 percent of cluster resources (CPU, disk, network) continuously. Misconfigured compaction or undersized hardware causes production incidents with elevated latencies and timeouts cascading across dependent services. Operational best practices include monitoring pending compaction tasks, read and write amplification metrics, tombstone scan counts per query, and per file read counts. Limit Time To Live (TTL) churn, design for upserts over deletes where possible, and schedule major compactions during off peak hours. Provision sufficient IOPS (commonly 10K to 50K IOPS per node on Solid State Drives) so compaction does not starve foreground queries.
💡 Key Takeaways
Size tiered compaction creates overlapping files with 5x to 10x write amplification and 5 to 10 file reads per query, best for write heavy workloads sustaining 100K+ writes/sec per node
Leveled compaction maintains non overlapping sorted runs with 10x to 20x write amplification but only 1 to 2 file reads per query, best for read heavy point lookups needing sub 5 ms p99
Tombstones from deletes or TTL persist until compaction merges files; scanning 1000 rows can read 50K tombstones causing p99 latency spikes from 10 ms to 500+ ms
Compaction stalls happen when write rate exceeds merge capacity; pending compactions accumulate increasing read amplification and eventually system refuses writes to prevent exhaustion
Netflix keeps disk occupancy under 60 to 70 percent and provisions 20 to 50 percent extra IOPS headroom so compaction background merges do not starve foreground query reads
Operational metrics include pending compaction count (alert above 5 to 10 tasks), read amplification (files per query), tombstone ratio per scan (warn above 10K), and compaction bandwidth (MB/s)
📌 Examples
Tombstone storm at time series metrics platform: Time To Live (TTL) set to 7 days caused 100M metrics to expire simultaneously at midnight. Compaction backlog grew to 500 tasks, scans traversed 200K tombstones each, p99 query latency jumped from 20 ms to 5 seconds, timeouts cascaded. Solution: Stagger TTL across 24 hour window, run incremental compactions hourly.
Compaction stall during flash sale: E-commerce write rate spiked from 50K to 500K writes/sec. Size tiered compaction could not keep pace, pending tasks grew from 2 to 30, read amplification increased from 5 to 15 files per query. Cluster reached 85 percent disk occupancy, applied backpressure refusing writes. Solution: Add nodes to distribute load, temporarily disable low priority background jobs.
Read path with leveled compaction: Point query checks 1 memtable (100 microseconds), 1 Level 0 file (bloom filter miss, 2 ms disk read), 1 Level 1 file (bloom filter hit, skip), 1 Level 2 file (index lookup 100 microseconds, block cache hit 0 ms). Total latency 2.2 ms. With size tiered and 8 overlapping files, same query needs 8 bloom checks and 3 disk reads totaling 8 to 12 ms.
← Back to Wide-Column Stores (Cassandra, HBase) Overview