What is Hash Based Partitioning and Why Use It?

Definition
Hash-based partitioning distributes data across nodes by applying a hash function to a partition key and mapping the result to a shard. The goal is uniform distribution so each partition gets roughly equal load.
partition = hash(key) mod N // N = number of partitions
Why Hash Partitioning Works:

Production systems use hash-based partitioning because it solves unpredictable or skewed key distributions. Amazon DynamoDB hashes the partition key to place items across physical partitions, where each partition provides around 10 GB storage and thousands of operations per second. Apache Kafka routes messages to partitions by hashing the message key, preserving per-key ordering while spreading different keys across the cluster.

This approach works well when your workload consists primarily of point lookups, per-key operations, or write-heavy traffic where range scans are rare.

The Trade-off: No Range Queries

Hash-based partitioning destroys global key ordering. If you hash user_id values, user 1000 and user 1001 will likely land on completely different partitions, making range queries expensive. Range scans require scatter-gather operations across all partitions, amplifying network traffic and latency.

⚠️ Common Pitfall: Simple modulo hashing (key % N) causes massive data movement when N changes. Adding one node to 100 nodes remaps nearly all keys. Use consistent hashing instead.

💡 Key Takeaways

✓Uniform load distribution: Hash functions spread keys evenly, minimizing hotspots when keys are diverse. Amazon DynamoDB achieves 50,000 WCU by ensuring at least 50 partitions with uniform key distribution across them.

✓Per key locality: All records for a given key colocate on the same partition, enabling efficient point reads, updates, and per key ordering as used in Apache Kafka message routing.

✓Destroys global ordering: Adjacent keys like user_1000 and user_1001 scatter to different partitions, making range queries require expensive scatter gather across all partitions with amplified network overhead.

✓Capacity planning from partition limits: If each partition supports 1,000 writes per second and you need 50,000 writes per second total throughput, you must provision at least 50 partitions with truly random key distribution.

✓Simple modulo has dangerous rehash cost: Changing partition count N in hash(key) mod N remaps most keys, causing full data movement. Production systems use consistent hashing or fixed bucket spaces instead.

✓Real world adoption is universal: Amazon Dynamo for shopping carts, DynamoDB for key value storage, Kafka for message routing, Facebook memcache clusters, and Google Maglev load balancers all rely on hash based partitioning.

📌 Interview Tips

1Amazon DynamoDB uses hash of partition key to distribute items. A table with 50,000 WCU provisioned needs at least 50 partitions since each partition handles roughly 1,000 WCU. A single hot key can saturate one partition causing throttling even when aggregate table capacity is available.

2Apache Kafka at LinkedIn hashes message keys to route to partitions, preserving per key ordering. A hot key forces all traffic to one partition, inflating end to end p99 latency and increasing consumer lag despite having hundreds of partitions available.

3Amazon S3 previously required random prefixes (hashing object keys) to avoid hotspots, publishing safe limits of 3,500 PUT and 5,500 GET requests per second per prefix. Customers spread load by adding hash prefixes to keys before adaptive partitioning was introduced.

← Back to Hash-based Partitioning Overview