Database DesignWide-Column Stores (Cassandra, HBase)Medium⏱️ ~3 min

Query First Data Modeling and Partition Design

Wide column stores require query first schema design where you denormalize data and precompute access patterns into separate tables, each optimized for a specific query. You cannot perform joins or ad hoc queries efficiently, so every anticipated access pattern needs its own table with a carefully chosen partition key. This is fundamentally different from relational normalized design. The partition key determines data placement and is the only efficient lookup path. All queries must include the partition key to avoid cluster wide scatter gather that causes unbounded tail latency. Clustering columns within a partition provide sorted ordering for range queries, but only within a single partition. Design partitions to be bounded (typically 100 to 200 MB compressed, avoiding multi GB partitions) to keep reads and compactions predictable. Time series workloads bucket by time (userId plus hourBucket or dayBucket) to prevent unbounded growth. Fan out indices maintain per user inbox tables where each user is a partition containing their thread list sorted by timestamp. Hot partition failures occur when partition keys are poorly chosen. Monotonic identifiers (sequential user IDs, timestamps as partition keys) concentrate writes on a single node. Celebrity users with millions of followers create read hotspots. Solutions include salting (adding random prefix to distribute load), composite keys (userId plus dateBucket), and sharding high traffic entities across multiple partitions. Apple iCloud handles millions of writes per second across tens of thousands of nodes by carefully distributing user metadata and device state with well chosen partition keys. The tradeoff is operational simplicity and predictable performance (bounded partitions give consistent p99 latency) versus query flexibility (no ad hoc analytics, joins, or global secondary indexes at scale).
💡 Key Takeaways
Query first design requires denormalizing data into separate tables for each access pattern, trading storage (2x to 5x duplication common) for bounded query latency
Partition keys must be included in every query to avoid cluster wide scatter gather; missing partition key causes fan out to all nodes with p99 latency spiking 10x to 100x
Bounded partitions (100 to 200 MB compressed target) keep reads predictable and compactions manageable; unbounded partitions cause compaction stalls and cache churn
Time series workloads require bucketing by time window (hourly or daily) in the partition key to prevent single partition from growing indefinitely
Hot partitions from monotonic keys or celebrity users concentrate load on single nodes; salting with random prefix (0 to 9) distributes across multiple partitions at cost of scatter reads
Global secondary indexes degenerate into cluster wide scatter with unbounded tail latency; prefer materialized views written at insert time with their own partition keys
📌 Examples
Netflix timeline service: Partition key (userId, dayBucket), clustering by timestamp descending. Query "show latest 50 posts" hits one partition (80 to 120 MB) in 5 to 8 ms p99. Older days archived to object storage after 30 days.
Instagram like counter: Original design used single partition per post for like count. Viral posts (millions of likes) overwhelmed single partition. Redesigned with salted counters: Partition key (postId, shard) with 100 shards. Increment random shard on like, sum all shards on read. Distributes writes across 100 partitions.
Hot partition failure: E-commerce flash sale sends 50K requests/sec to single product partition. Node handling partition maxes out Central Processing Unit (CPU), p99 latency jumps from 10 ms to 500 ms, timeouts cascade. Solution: Pre shard product inventory across 10 partitions before sale.
← Back to Wide-Column Stores (Cassandra, HBase) Overview
Query First Data Modeling and Partition Design | Wide-Column Stores (Cassandra, HBase) - System Overflow