Partitioning & Sharding • Hash-based PartitioningEasy⏱️ ~3 min
What is Hash Based Partitioning and Why Use It?
Definition
Hash-based partitioning distributes data across nodes by applying a hash function to a partition key and mapping the result to a shard. The goal is uniform distribution so each partition gets roughly equal load.
partition = hash(key) mod N // N = number of partitions
⚠️ Common Pitfall: Simple modulo hashing (key % N) causes massive data movement when N changes. Adding one node to 100 nodes remaps nearly all keys. Use consistent hashing instead.
💡 Key Takeaways
✓Uniform load distribution: Hash functions spread keys evenly, minimizing hotspots when keys are diverse. Amazon DynamoDB achieves 50,000 WCU by ensuring at least 50 partitions with uniform key distribution across them.
✓Per key locality: All records for a given key colocate on the same partition, enabling efficient point reads, updates, and per key ordering as used in Apache Kafka message routing.
✓Destroys global ordering: Adjacent keys like user_1000 and user_1001 scatter to different partitions, making range queries require expensive scatter gather across all partitions with amplified network overhead.
✓Capacity planning from partition limits: If each partition supports 1,000 writes per second and you need 50,000 writes per second total throughput, you must provision at least 50 partitions with truly random key distribution.
✓Simple modulo has dangerous rehash cost: Changing partition count N in hash(key) mod N remaps most keys, causing full data movement. Production systems use consistent hashing or fixed bucket spaces instead.
✓Real world adoption is universal: Amazon Dynamo for shopping carts, DynamoDB for key value storage, Kafka for message routing, Facebook memcache clusters, and Google Maglev load balancers all rely on hash based partitioning.
📌 Interview Tips
1Amazon DynamoDB uses hash of partition key to distribute items. A table with 50,000 WCU provisioned needs at least 50 partitions since each partition handles roughly 1,000 WCU. A single hot key can saturate one partition causing throttling even when aggregate table capacity is available.
2Apache Kafka at LinkedIn hashes message keys to route to partitions, preserving per key ordering. A hot key forces all traffic to one partition, inflating end to end p99 latency and increasing consumer lag despite having hundreds of partitions available.
3Amazon S3 previously required random prefixes (hashing object keys) to avoid hotspots, publishing safe limits of 3,500 PUT and 5,500 GET requests per second per prefix. Customers spread load by adding hash prefixes to keys before adaptive partitioning was introduced.