Partitioning & Sharding • Hash-based PartitioningHard⏱️ ~4 min
Hot Key Problem and Shard Rebalancing Strategies in Hash Based Systems
Hash based partitioning has a fundamental weakness: identical keys always route to the same partition, so if a small number of keys dominate traffic (following Zipfian or power law distributions), those partitions become bottlenecks regardless of overall cluster capacity. Zipfian distribution means that a small percentage of items account for a disproportionately large share of access: typically the top 1% of keys generate 50 to 80% of all requests in multi tenant systems. In Amazon DynamoDB, a single partition provides roughly 1,000 WCU for 1 KB items. If one customer identifier generates 5,000 writes per second, that partition throttles even if the table is provisioned for 100,000 WCU across 100 partitions. The other 99 partitions sit idle while the hot partition rejects requests with ProvisionedThroughputExceededException. Similarly, in Apache Kafka, a hot message key collapses throughput to a single partition leader's disk and network limits, typically 50 to 100 MB per second, causing consumer lag to spike and end to end p99 latency to degrade from milliseconds to seconds.
Key salting spreads a single logical key across multiple physical keys by appending a suffix. For example, user_12345 becomes user_12345_0, user_12345_1, through user_12345_K, where K is chosen based on observed load (often 2 to 16 sub keys). Writes fan out to all K sub keys, and reads must aggregate results from all K. This multiplies read cost by K but successfully distributes write load. The real challenge comes when adding new shards to handle growth or rebalancing existing shards to address hotspots. Shard rebalancing involves three phases: First, provision new shard nodes and update the routing metadata (directory service or consistent hashing ring). Second, begin dual writing to both old and new shards for affected key ranges while background jobs copy existing data from old shards to new shards. Third, once data migration completes and new shards are fully synchronized, switch reads to new shards and decommission old shards. During rebalancing, the system operates in mixed mode where some requests hit old shards and some hit new shards based on migration progress. This requires careful coordination to avoid data inconsistency.
To minimize downtime during shard rebalancing, systems use techniques like shadow traffic and gradual cutover. Shadow traffic means sending read requests to both old and new shards, comparing results for correctness, but serving responses only from old shards until confidence is high. Once validation passes (typically 99.9% match rate over several hours), gradually shift traffic: route 1% of reads to new shards, then 10%, then 50%, then 100%, monitoring error rates at each step. If errors spike, immediately rollback to previous percentage. Write traffic requires more care: use dual writes during migration where every write goes to both old and new shards, with the old shard as source of truth until cutover. This ensures no data loss even if new shards fail during migration. Google Spanner and Amazon DynamoDB use this pattern extensively. For example, when DynamoDB detects a hot partition consuming over 1,000 WCU, it automatically splits the partition into two, copying half the key range to a new partition over 10 to 30 minutes while dual writing to both partitions. Client SDKs automatically refresh routing metadata every 5 seconds, so they discover new partitions quickly. The entire process is transparent to applications, with no downtime and minimal latency impact (typically 5 to 10 millisecond p99 increase during active migration).
💡 Key Takeaways
•Zipfian distribution drives hot key problems: Top 1% of keys generate 50 to 80% of traffic in real systems. This skewed access pattern means uniform hash partitioning fails because a few partitions handle most load while others sit idle.
•Shard rebalancing has three phases to maintain availability: First, provision new shards and update directory metadata. Second, dual write to old and new shards while background jobs copy existing data. Third, cutover reads to new shards after validation and decommission old shards. Total migration takes 10 to 30 minutes typically.
•Dual writes during migration prevent data loss: Every write goes to both old shard (source of truth) and new shard (shadow copy) until cutover completes. If new shard fails during migration, old shard still has all data so no loss occurs.
•Gradual traffic shifting minimizes risk: Route 1% of reads to new shards, monitor error rates, then increase to 10%, 50%, 100% over hours or days. Immediate rollback capability at each step ensures safety. Shadow traffic technique validates new shards by comparing responses before cutover.
•Client routing metadata refresh is critical: Clients must poll directory service or receive push notifications every 1 to 5 seconds to discover new shards quickly. Stale routing metadata causes requests to hit wrong shards, triggering redirects or errors.
•Single hot key can saturate a partition despite aggregate capacity: DynamoDB partition limit is roughly 1,000 WCU. One key generating 5,000 writes per second throttles that partition even if table has 100,000 WCU provisioned across 100 partitions.
📌 Examples
DynamoDB automatic partition split: Table has 50 partitions, each handling 1,000 WCU. Celebrity user generates 3,000 writes per second, throttling their partition. DynamoDB detects sustained throttling over 5 minutes, provisions new partition, splits key range user_0 to user_5000 into two ranges (user_0 to user_2500 and user_2501 to user_5000), copies data to new partition over 20 minutes with dual writes, updates directory atomically, then routes new requests to appropriate partitions. Celebrity user traffic now splits across two partitions at 1,500 writes per second each, eliminating throttling.
Kafka rebalancing during hot key: E commerce platform routes order events by product_id. Flash sale on popular product generates 10,000 messages per second to partition 3. Operations team detects lag spike from 0 to 30 seconds. Solution involves application change: append warehouse_id suffix to product_id (product_12345 becomes product_12345_warehouse_A). Deploy change with feature flag at 10% traffic, validate messages spread across partitions 3, 7, 12 based on new composite key. Increase to 50%, then 100% over 2 hours. Consumer lag drops back to under 1 second as load distributes.
Key salting with gradual rollout: Social media app has user_id as partition key. Influencer user_99999 generates 4,000 writes per second. Engineering implements salting: map user_99999 to 4 sub keys (user_99999_0 through user_99999_3) in application metadata store. Deploy write path change: split writes across 4 sub keys, each receiving 1,000 writes per second. Deploy read path change: aggregate results from 4 sub keys and merge. Use feature flag to enable salting for user_99999 only, monitor for 24 hours, then enable for top 100 high traffic users. Read latency increases from 5 ms to 12 ms (4x fan out overhead) but write throttling eliminated.
Directory based split with shadow traffic: Google Spanner tablet (shard) for key range A to M exceeds 1,000 QPS threshold. Split coordinator provisions new tablet for range G to M, updates directory with split marker indicating migration in progress. All writes now go to both old tablet (A to M) and new tablet (G to M) with old tablet as authoritative. Background splitter copies existing data from G to M range over 15 minutes. Once copy completes, shadow reads start: 10% of reads for keys G to M go to new tablet, results compared with old tablet, 99.95% match after 1 hour. Increase shadow reads to 100%, validate for 2 hours. Perform cutover: atomically update directory so keys G to M route only to new tablet. Old tablet now handles only A to F. Monitor for 24 hours then garbage collect G to M data from old tablet.