Wide-Column Store Data Model and LSM-Tree Storage

Definition
Wide-column stores organize data with a two-part primary key: a partition key determining node placement, and clustering columns sorting cells within that partition. Each row can have different columns, and only written columns consume storage.
LSM-Tree Storage Engine
Wide-column stores use LSM-Trees (Log-Structured Merge-Trees), optimized for write-heavy workloads. Writes append to an in-memory buffer called the memtable (1-3ms latency) rather than updating files in place. A write-ahead log ensures durability. When full (64-256MB), the memtable flushes to disk as an immutable SSTable (Sorted String Table). These files never change once written.
Read Path Mechanics
Reads check memtable first, then each SSTable from newest to oldest. To avoid scanning every file, each SSTable has a bloom filter (a probabilistic structure answering "definitely not" or "possibly yes" with 1% false positives). A read checks 3-7 bloom filters in 1-3us each, only reading relevant files. Hot data in cache yields sub-ms reads; cold data needs disk seeks (100us SSD).
Scale and Access Patterns
Clustering columns enable efficient range scans within a partition. Store an activity feed with partition key user_id and clustering column timestamp DESC: fetching 50 recent activities is one sequential read. Keep partitions under 100-200MB compressed for predictable latency.
The tradeoff: every query requires partition key. Cross-partition queries scatter to all nodes with unbounded latency. Joins are impossible. Design tables around access patterns, not entities.

💡 Key Takeaways

✓LSM-Trees buffer writes in memtable (1-3ms latency) then flush to immutable SSTables on disk, trading read amplification for write optimization

✓Bloom filters skip irrelevant SSTables with 1% false positive rate, checking each in 1-3 microseconds before any disk I/O

✓Read path: check memtable, then 3-7 SSTables via bloom filter, index lookup, block read (sub-ms cached, 100us SSD uncached)

✓Partition key determines node placement; clustering columns sort data within partition enabling efficient range scans

✓Keep partitions under 100-200MB compressed for predictable compaction and query latency

✓Every query requires partition key; cross-partition queries scatter to all nodes; joins impossible

📌 Interview Tips

1Explain LSM-Tree write path: memtable buffer, WAL for durability, flush to SSTable, background compaction. Shows you understand the storage engine.

2Calculate read amplification: 1 memtable + 5 SSTables with bloom checks (5us), 2 index lookups (200us), 2 block reads (4ms). Total 4-5ms.

3Partition sizing: 100-200MB keeps compaction under 10 seconds. Use time bucketing (userId_hourBucket) to prevent unbounded growth.

← Back to Wide-Column Stores (Cassandra, HBase) Overview