Partitioning & Sharding • Hotspot Detection & HandlingEasy⏱️ ~2 min
What Are Hotspots and Why Do They Matter?
Definition
Hotspots are disproportionate concentrations of load on a small subset of resources. A single partition receiving 80% of traffic while others sit idle. A celebrity's post triggering millions of fan-out writes. The critical issue: one resource saturates while aggregate system capacity remains underutilized.
✓ In Practice: Track per-partition metrics (CPU, IOPS, queue depth), not just aggregates. A system with 10% average utilization can still have hotspots if one partition is at 100%.
💡 Key Takeaways
✓Hotspots cause local saturation (CPU, IOPS, network, queue depth) even when aggregate system capacity is ample, leading to tail latency spikes from milliseconds to seconds and throttling errors
✓Amazon DynamoDB partitions provide roughly 3,000 RCU or 1,000 WCU each; a hot key on one partition throttles at those limits regardless of unused capacity elsewhere in the table
✓Amazon Kinesis shards support 1 MB/s or 1,000 records/s ingress and 2 MB/s egress; hot partition keys concentrate load on single shards while other shards remain idle
✓Detection requires per key, per partition, and per node telemetry rather than aggregate metrics, since averages hide the skew that causes hotspots
✓Effective handling combines proactive design (well distributed partition keys), online detection (streaming heavy hitter algorithms), dynamic mitigation (caching, rebalancing, splitting), and load shedding with fair isolation
📌 Interview Tips
1Twitter's celebrity problem: accounts with millions of followers cause write storms during fan out on write, requiring hybrid strategies where celebrities use fan out on read to avoid concentrating writes
2Google Cloud Bigtable with sequential timestamp keys sends most writes to the hottest tablet/region, saturating a single node at roughly 10,000 operations/s even when the cluster has spare capacity; solution is to hash or reverse timestamp components
3Amazon S3 historically had request rate limits per key prefix (roughly 3,500 PUT/POST/DELETE and 5,500 GET per second per prefix), illustrating how clustered keys create hot partitions before automatic partitioning improvements