Database DesignKey-Value Stores (Redis, DynamoDB)Hard⏱️ ~3 min

Storage Engine Patterns: In Memory vs Log Structured Merge Trees

In memory key value stores like Redis keep both index and values in RAM for microsecond to sub millisecond access. A single shard commonly sustains 100,000 to 500,000 operations per second on commodity hardware with sub millisecond p50 latency and low single digit millisecond p99. Persistence happens via periodic snapshots (copy entire dataset to disk) or append only logs where each write appends to a file. The tradeoff is cost: RAM costs 10 to 100 times more per GB month than solid state drives, making in memory stores best for hot data, sessions, and derived views under tens of gigabytes per shard. Log Structured Merge trees (LSM trees) optimize for write throughput on disk. Writes append to a commit log for durability, then insert into an in memory memtable (often a red black tree or skip list). When the memtable reaches size threshold (typically 64 MB to 128 MB), it flushes to an immutable sorted file called an SSTable on disk. Reads consult the memtable first, then bloom filters (probabilistic data structure with configurable false positive rate around 1%) to skip SSTables not containing the key, finally reading relevant SSTables. DynamoDB and Cassandra use LSM trees to handle millions of writes per second with single to tens of milliseconds latency. Compaction merges SSTables to control read amplification. Without compaction, reads must check dozens of files degrading latency. Leveled compaction (used by RocksDB) organizes SSTables into levels where each level is 10 times larger, limiting reads to one SSTable per level for predictable latency. Size tiered compaction merges SSTables of similar size, optimizing for write throughput but causing higher read amplification. Compaction consumes input/output bandwidth and CPU; production systems rate limit compaction and schedule it off peak to avoid latency spikes. Tuning bloom filter false positive rates, memtable size, and compaction strategy balances write amplification (bytes written to disk per application write), read amplification (SSTables checked per read), and space amplification (storage overhead). Large values and memory fragmentation plague in memory stores. Storing megabyte blobs increases tail latency from memory copy costs and causes allocator fragmentation leading to garbage collection pauses in some implementations. Production systems cap value sizes around 1 MB, use external blob stores for large objects, and tune memory allocators (jemalloc for Redis). For LSM trees, snapshotting via copy on write forks can spike latency; isolating snapshots to replica nodes avoids impacting primary serving path.
💡 Key Takeaways
In memory stores sustain 100,000 to 500,000 operations per second per shard with sub millisecond p50 but cost 10 to 100 times more per GB month than solid state drives
Log Structured Merge trees append writes to memtable (64 MB to 128 MB) then flush to immutable SSTables on disk, achieving millions of writes per second with 5 to 10 milliseconds latency
Bloom filters with 1% false positive rate skip SSTables during reads; leveled compaction limits reads to one SSTable per level for predictable latency versus size tiered optimizing write throughput
Compaction merges SSTables controlling read amplification but consumes input/output bandwidth; production systems rate limit compaction and schedule off peak to avoid latency spikes
Large values cause memory fragmentation and tail latency spikes in in memory stores; cap at 1 MB and store blobs externally with pointers in key value system
📌 Examples
Redis single shard handles 500,000 operations per second with sub millisecond p99 using in memory storage with asynchronous snapshots to disk, total cost $50/GB/month for memory
DynamoDB uses LSM trees handling over 100 million requests per second during Prime Day with single digit millisecond median latencies and automatic compaction management
Cassandra leveled compaction strategy limits reads to checking log2(total data size) SSTables providing predictable p99 latencies under 20 milliseconds for multi terabyte datasets
← Back to Key-Value Stores (Redis, DynamoDB) Overview