Database DesignWide-Column Stores (Cassandra, HBase)Medium⏱️ ~2 min

Wide-Column Store Data Model and LSM-Tree Storage

Wide column stores organize data into rows identified by a primary key with two parts: a partition key that determines which nodes hold the data, and clustering columns that sort cells within the row. Each row can have sparse, dynamic columns where cells exist only if written. This is row oriented storage optimized for key based access with very wide, sorted rows, not columnar analytics. Under the hood, most implementations use Log Structured Merge (LSM) tree storage with three layers: an append only write ahead log for durability, in memory tables for fast writes, and immutable sorted files on disk. Writes hit memory first (typically 1 to 3 ms), then flush to disk asynchronously. Background compactions continuously merge and purge older versions and tombstones to reclaim space. Reads traverse multiple layers using bloom filters to skip files, indexes to locate blocks, and block caches to avoid disk. A single read might check memory plus 3 to 7 on disk files, merged to produce the final result. This design trades lower write amplification (append only, no in place updates) for higher read path complexity requiring careful tuning. Netflix operates hundreds of clusters with thousands of nodes handling over 10 million operations per second sustained at peak, achieving single digit millisecond medians and sub 15 to 20 ms p99 latency with replication factor 3 using Solid State Drive (SSD) backed nodes and aggressive compaction tuning.
💡 Key Takeaways
Partition key determines data placement across nodes while clustering columns sort cells within a row, enabling bounded range queries on a single partition
LSM tree writes are append only hitting memory first in 1 to 3 ms, then flushing asynchronously to immutable disk files
Reads merge across memory and 3 to 7 on disk files using bloom filters to skip files, indexes to locate blocks, and caches to avoid disk seeks
Background compactions merge files and purge old versions, trading CPU and disk Input/Output Operations Per Second (IOPS) for space reclamation and read efficiency
Storage overhead is approximately 3x logical data at replication factor 3, plus 20 to 50 percent transient headroom for compactions
Netflix achieves sub 20 ms p99 latency at 10+ million ops/sec sustained using SSD backed nodes, token aware clients, and carefully tuned compaction strategies
📌 Examples
Time series data model: Partition key is userId plus hourBucket (user_123_2024010115), clustering column is timestamp descending. Query latest 100 events for a user hits one partition with bounded size under 200 MB compressed.
Apple iCloud uses tens of thousands of nodes storing 100+ PB across clusters, sustaining millions of writes/sec with p99 latencies under tens of milliseconds using multi datacenter active active replication.
Read amplification example: Single point read checks 1 memtable, 4 SSTables with bloom filter checks (1 microsecond each), index lookups (100 microseconds each), and block reads (1 to 3 ms each if not cached). Total 5 to 15 ms without cache hits.
← Back to Wide-Column Stores (Cassandra, HBase) Overview