Query First Data Modeling and Partition Design

Query-First Schema Design
Wide-column stores require query-first design: you create separate tables for each access pattern, storing denormalized (pre-joined) data optimized for specific queries. This is the opposite of relational design where you normalize data (eliminate duplication) and join at query time. Wide-column stores cannot perform joins, so every anticipated query needs its own table with a carefully chosen partition key.
Example: a messaging app needs "get messages for channel" and "get user profile." Create messages_by_channel (partition key: channel_id) and users (partition key: user_id). Store user names denormalized in messages rather than joining at read time.
Partition Key Design
The partition key is the only efficient lookup path. Queries without partition key cause scatter-gather: the coordinator sends requests to every node in the cluster, waits for all responses, then merges results. With 100 nodes, p99 latency becomes the slowest node (often 10-100x worse than single-partition queries). Always include partition key in queries.
Design partitions to be bounded. Target 100-200MB compressed per partition. Time-series workloads bucket by time: partition key userId_hourBucket instead of just userId prevents unbounded growth from active users.
Hot Partition Problems
Hot partitions occur when poor key choices concentrate traffic. Monotonic keys (sequential IDs, timestamps as partition key) route all writes to one node. Celebrity accounts with millions of followers create read hotspots. Solutions include:
Salting: Add random prefix (0-9) to partition key. Distributes across 10 partitions but requires scatter-read to fetch all. Composite keys: Add time bucket to high-cardinality keys. Pre-sharding: Create multiple partitions before traffic spikes.
Key Trade-off: Query-first design trades storage (2-5x duplication typical) and write complexity (maintain multiple tables) for bounded, predictable query latency. You model data around access patterns, not around entities.

💡 Key Takeaways

✓Query-first design creates separate denormalized tables for each access pattern since wide-column stores cannot perform joins

✓Partition key is the only efficient lookup; missing it causes scatter-gather to all nodes with p99 latency 10-100x worse

✓Target 100-200MB compressed per partition; time-series uses bucketing (userId_hourBucket) to prevent unbounded growth

✓Hot partitions from monotonic keys or celebrity accounts concentrate load; salting adds random prefix distributing across N partitions

✓Salted partitions trade write distribution for scatter-read on fetch (10 partitions = 10 parallel reads to get all data)

✓Storage overhead of 2-5x duplication is the cost of query-first design; each access pattern gets its own optimized table

📌 Interview Tips

1Design messaging schema: messages_by_channel (partition: channel_id, clustering: timestamp DESC) and messages_by_user (partition: user_id, clustering: timestamp DESC). Same data, two tables, two access patterns.

2Explain scatter-gather penalty: 100-node cluster, single partition query = 5ms. Missing partition key = 100 parallel queries, p99 is slowest node (often 50-200ms).

3Hot partition solution: flash sale product gets 50K req/s. Pre-shard to 10 partitions (product_shard_0 through product_shard_9), round-robin writes, merge reads.

← Back to Wide-Column Stores (Cassandra, HBase) Overview