Consistent Hashing, Replication, and Rebalancing in Distributed Caches

Consistent Hashing
Each cache node is assigned multiple positions (virtual nodes, typically 100-500 per physical node) on a logical hash ring. Each key hashes to a position and routes to the first node clockwise. Virtual nodes ensure even distribution and minimize data movement during scaling. Adding a node takes 1/N keys from existing nodes. Contrast with naive modulo sharding: adding one node to a 10 node cluster remaps 90% of keys, causing massive cache misses. Consistent hashing moves only ~10%.
Replication
Each partition has a primary node handling writes and 1-2 replica nodes receiving asynchronous updates. Reads can be served from any replica, enabling same availability zone preference for sub millisecond latency and reduced cross zone costs. Replicating across 3 availability zones means zone failures lose only one replica; traffic automatically shifts to remaining zones. Async replication keeps write latency low (sub millisecond) with milliseconds of replication lag.
Rebalancing Challenges
Adding nodes causes widespread cache misses as keys rehash to new nodes, spiking backend load until cache warms. A sudden 10% key shift can drop hit rate by 10 points and increase database QPS by 100,000+ at scale.
Rebalancing Mitigations
Gradual traffic shifting: mark new node as not ready, background process pre fills from existing nodes before serving traffic. Sticky client maps delay updating hash ring membership for seconds to minutes, allowing warm up time. Using 300+ virtual nodes spreads rebalancing load across all existing nodes rather than concentrating on ring neighbors.

💡 Key Takeaways

✓Consistent hashing with 100-500 virtual nodes per physical node. Adding 1 node to 10 node cluster moves ~10% of keys vs 90% with modulo sharding.

✓Replication: 1-2 replicas per partition with async replication (milliseconds lag). Reads from any replica enable AZ preference for sub ms latency.

✓Rebalancing storms: 10% key shift can drop hit rate by 10 points and spike DB QPS by 100K+.

✓Mitigate with pre warming new nodes, sticky client maps delaying ring updates, and 300+ virtual nodes spreading load.

📌 Interview Tips

1Consistent hashing math: 10 node cluster, add 1 node, new node takes 1/11 ≈ 9% of keys. Modulo sharding would remap 90%.

2AZ replication: 3 zones, zone failure loses 1 replica, reads shift to remaining 2. Async replication keeps write latency sub ms.

3Pre warming: mark new node not ready, background process copies data, then gradually shift traffic over minutes.

← Back to Distributed Caching Overview