Partitioning & ShardingHash-based PartitioningEasy⏱️ ~3 min

What is Hash Based Partitioning and Why Use It?

Hash based partitioning distributes data records across multiple nodes by applying a deterministic hash function to a partition key (such as user_id or account_id) and mapping the result to a specific shard or physical server. The fundamental pattern is simple: compute partition = hash(key) mod N, where N is the number of partitions. The core goal is uniform distribution so that each partition receives roughly equal keys and traffic, eliminating hotspots and enabling linear horizontal scalability. Unlike range partitioning where adjacent keys stay together, hash partitioning intentionally scatters keys across the cluster to balance load. Production systems use hash based partitioning extensively because it solves the problem of unpredictable or skewed key distributions. Amazon DynamoDB hashes the partition key to place items across physical partitions, where each partition historically provided around 10 GB storage, approximately 1,000 write capacity units (WCU), and roughly 3,000 read capacity units (RCU). Apache Kafka routes messages to partitions by hashing the message key, preserving per key ordering while spreading different keys across the cluster. This approach works well when your workload consists primarily of point lookups, per key operations, or write heavy traffic where range scans are rare. Systems that handle millions of small independent entities like user sessions, shopping carts, or event streams benefit from the natural load balancing that hashing provides. The trade off is that hash based partitioning destroys global key ordering. If you hash user_id values, user 1000 and user 1001 will likely land on completely different partitions, making range queries expensive. Range scans require scatter gather operations across all partitions, amplifying network traffic and latency. For analytics or time series workloads where you frequently query contiguous ranges, range partitioning or time based sharding is typically superior. Choose hash based partitioning when you need high throughput on independent per key operations and can tolerate expensive range queries.
💡 Key Takeaways
Uniform load distribution: Hash functions spread keys evenly, minimizing hotspots when keys are diverse. Amazon DynamoDB achieves 50,000 WCU by ensuring at least 50 partitions with uniform key distribution across them.
Per key locality: All records for a given key colocate on the same partition, enabling efficient point reads, updates, and per key ordering as used in Apache Kafka message routing.
Destroys global ordering: Adjacent keys like user_1000 and user_1001 scatter to different partitions, making range queries require expensive scatter gather across all partitions with amplified network overhead.
Capacity planning from partition limits: If each partition supports 1,000 writes per second and you need 50,000 writes per second total throughput, you must provision at least 50 partitions with truly random key distribution.
Simple modulo has dangerous rehash cost: Changing partition count N in hash(key) mod N remaps most keys, causing full data movement. Production systems use consistent hashing or fixed bucket spaces instead.
Real world adoption is universal: Amazon Dynamo for shopping carts, DynamoDB for key value storage, Kafka for message routing, Facebook memcache clusters, and Google Maglev load balancers all rely on hash based partitioning.
📌 Examples
Amazon DynamoDB uses hash of partition key to distribute items. A table with 50,000 WCU provisioned needs at least 50 partitions since each partition handles roughly 1,000 WCU. A single hot key can saturate one partition causing throttling even when aggregate table capacity is available.
Apache Kafka at LinkedIn hashes message keys to route to partitions, preserving per key ordering. A hot key forces all traffic to one partition, inflating end to end p99 latency and increasing consumer lag despite having hundreds of partitions available.
Amazon S3 previously required random prefixes (hashing object keys) to avoid hotspots, publishing safe limits of 3,500 PUT and 5,500 GET requests per second per prefix. Customers spread load by adding hash prefixes to keys before adaptive partitioning was introduced.
← Back to Hash-based Partitioning Overview
What is Hash Based Partitioning and Why Use It? | Hash-based Partitioning - System Overflow