Production Implementation: Capacity Planning and Rebalancing Operations

Effective capacity planning for hash based partitioned systems starts with understanding per partition limits and working backward to cluster size. Each storage or compute partition has physical constraints: disk throughput (typically 100 to 500 MB per second for SSD), network bandwidth (1 to 10 Gbps), CPU for request processing, and memory for caching. For DynamoDB style systems, each partition handles roughly 1,000 write capacity units (WCU) and 3,000 read capacity units (RCU) where one WCU is one 1 KB write per second. If your application needs 50,000 WCU sustained throughput, you must provision at least 50 partitions and ensure your key distribution actually uses all of them uniformly. Add 20 to 50 percent headroom for growth and traffic spikes. Storage capacity also drives partition count: if partitions are capped at 10 GB each and you store 5 TB, you need at least 500 partitions even if throughput requirements are lower.

Rebalancing operations must be carefully throttled to avoid impacting foreground traffic. When adding nodes or rebalancing data, background data movement competes with live requests for disk IO, network bandwidth, and CPU cycles. Production systems implement admission control and rate limiting on rebalance traffic, typically capping it at 10 to 30 percent of available resources. Amazon DynamoDB limits partition split operations and background data movement to maintain single digit millisecond p99 latencies for customer requests. Operators monitor p99 and p99.9 latency continuously during rebalancing and pause or slow data movement if latency budgets are exceeded. Prioritize hot partitions first: rebalance the top 10 percent of partitions by traffic before touching cold data, as this provides immediate load relief. Track rebalancing progress with metrics like total keys moved, MB transferred, time remaining, and impact on tail latency.

Observability is critical for detecting imbalance and hotspots before they cause outages. Instrument per partition metrics including queries per second (QPS), bytes per second, queue depth, CPU utilization, and p99 latency. Calculate coefficient of variation across partitions (standard deviation divided by mean) for each metric; values above 0.3 indicate significant imbalance requiring investigation. Track top N keys by request rate using reservoir sampling or count min sketch to identify hot keys without storing full key distributions. In Kafka deployments at LinkedIn scale, per partition lag metrics are monitored continuously; lag spikes on individual partitions signal hot keys or consumer failures. Set alerts for partition level throttling events, imbalanced partition counts per node (more than 10 percent variance), and tail latency regressions during topology changes. Dashboards should expose real time partition distribution heat maps showing request rates and storage per partition, enabling operators to quickly identify and remediate hotspots.

💡 Key Takeaways

✓Derive partition count from per partition limits and workload: DynamoDB style systems support roughly 1,000 WCU per partition. For 50,000 WCU throughput, provision at least 50 partitions plus 20 to 50 percent headroom for spikes and growth.

✓Storage capacity drives minimum partition count independently: If partitions are capped at 10 GB and you store 5 TB total, you need at least 500 partitions regardless of throughput requirements. Plan for the larger of throughput driven or storage driven partition count.

✓Throttle rebalancing to 10 to 30 percent of resources: Background data movement competes with foreground traffic. Cap rebalance bandwidth and IO to maintain single digit millisecond p99 latencies. Pause rebalancing if tail latency exceeds budget.

✓Prioritize hot partitions during rebalancing: Rebalance top 10 percent of partitions by traffic first to immediately relieve load. Cold data can migrate gradually over days without impacting user experience.

✓Monitor coefficient of variation for imbalance detection: Calculate standard deviation divided by mean across partitions for QPS, bytes, and CPU. Values above 0.3 indicate significant skew requiring investigation and potential key redistribution.

✓Track top N keys with reservoir sampling: Identify hot keys without storing full distributions. Kafka deployments monitor per partition lag; spikes on single partitions reveal hot keys collapsing throughput to partition leader limits.

📌 Interview Tips

1DynamoDB capacity planning: Application needs 50,000 sustained writes per second of 1 KB items. Each partition supports 1,000 WCU, so minimum 50 partitions required. Add 30 percent headroom for traffic spikes: 65 partitions provisioned. Also store 8 TB data at 10 GB per partition: need 800 partitions. Final: 800 partitions to satisfy storage constraint.

2Kafka partition sizing at LinkedIn: Each partition leader handles approximately 50 MB per second sustained throughput. Topic receives 2 GB per second total. Minimum 40 partitions needed. To allow for hot keys and uneven distribution, provision 60 partitions (50 percent headroom). Monitor per partition lag and rebalance if any partition consistently lags.

3Rebalancing throttling in Cassandra: Operator adds 3 new nodes to 30 node cluster. Nodetool repair and streaming configured to use maximum 100 Mbps per stream. With 3 concurrent streams, maximum 300 Mbps out of 10 Gbps NIC (3 percent bandwidth). Monitor p99 latency; if exceeds 20 ms from baseline 5 ms, reduce stream count to 1.

4Imbalance detection with metrics: Cluster has 100 partitions. Mean partition QPS is 500. Standard deviation is 200. Coefficient of variation is 200 divided by 500 equals 0.4, indicating significant imbalance. Drill into top 5 partitions: they handle 1,200 QPS each. Identify keys on those partitions and apply salting to top 3 keys.

← Back to Hash-based Partitioning Overview