Database DesignIndexing StrategiesMedium⏱️ ~3 min

When Should You Choose LSM Trees Over B+ Trees?

Log Structured Merge (LSM) trees optimize for write throughput by batching updates in memory and flushing them sequentially to disk, avoiding the random writes and page split overhead inherent in B+ trees. Writes first go to an in memory structure called a memtable (typically a skiplist or red black tree) and an append only Write Ahead Log (WAL). When the memtable reaches a threshold like 64 to 256 MB, it flushes to an immutable on disk file called an SSTable (Sorted String Table). Background compaction merges SSTables across levels to bound the number of files reads must check. This architecture delivers sustained write throughput of 50K to 100K operations per second on commodity SSDs because writes are always sequential. Google Cloud Bigtable achieves single digit millisecond p50 read latency for point lookups when the working set is warm, hitting only 1 to 3 SSTables after recent compactions. Amazon DynamoDB uses LSM trees and reports single digit millisecond p50 for both reads and writes at any scale, with Global Secondary Indexes (GSIs) each backed by separate LSM trees. The tradeoff is read amplification and unpredictable tail latency. A point lookup must check the memtable plus every SSTable that might contain the key. Bloom filters (typically 10 bits per key yielding 1 percent false positive rate) skip most negative lookups, but positive hits still require reading from multiple levels. Range scans suffer more because they cannot rely on Bloom filters and must merge sort results from many SSTables. During heavy write periods, compaction can fall behind, increasing the number of SSTables from 3 to 10 or more, which directly inflates p99 read latency. LSM trees shine in write heavy Key Value (KV) and NoSQL workloads where point lookups dominate and you can tolerate occasional latency spikes during compaction. B+ trees remain superior for read heavy Online Transaction Processing (OLTP) with frequent range queries and strict p99 latency requirements, because they offer predictable 3 to 4 page read costs regardless of write load.
💡 Key Takeaways
Sequential writes enable 50K to 100K sustained write operations per second on SSDs compared to 10K to 20K for B+ trees with random page updates
Bloom filters at 10 bits per key yield 1 percent false positive rate, allowing reads to skip 99 percent of SSTables that do not contain the target key
Read amplification increases during heavy write periods: normal state checks 1 to 3 SSTables, but compaction debt can push this to 10 or more files inflating p99 latency by 3x to 5x
Each secondary index in LSM systems is a separate LSM tree, so adding 2 Global Secondary Indexes triples write amplification from 1x to 3x across primary and index trees
Range scans must merge sort results from multiple SSTables without Bloom filter benefits, making them 5x to 10x slower than B+ tree range scans on the same dataset
📌 Examples
Google Cloud Bigtable: Single digit millisecond p50 reads for point lookups when row and column family working sets are warm; Bloom filters reduce disk touches by skipping negative lookups across 3 SSTable levels
Amazon DynamoDB: At 50K to 100K writes per second, Global Secondary Index lag can reach seconds when compaction and asynchronous propagation saturate, causing temporary read your writes inconsistency
Apache Cassandra (LSM): A table with 3 secondary indexes writing at 20K ops/sec generates 80K total write operations (20K primary + 60K index updates) increasing storage Input/Output (I/O) and compaction load proportionally
← Back to Indexing Strategies Overview
When Should You Choose LSM Trees Over B+ Trees? | Indexing Strategies - System Overflow