Consistent Hashing, Replication, and Rebalancing in Distributed Caches
Consistent Hashing
Each cache node is assigned multiple positions (virtual nodes, typically 100-500 per physical node) on a logical hash ring. Each key hashes to a position and routes to the first node clockwise. Virtual nodes ensure even distribution and minimize data movement during scaling. Adding a node takes 1/N keys from existing nodes. Contrast with naive modulo sharding: adding one node to a 10 node cluster remaps 90% of keys, causing massive cache misses. Consistent hashing moves only ~10%.
Replication
Each partition has a primary node handling writes and 1-2 replica nodes receiving asynchronous updates. Reads can be served from any replica, enabling same availability zone preference for sub millisecond latency and reduced cross zone costs. Replicating across 3 availability zones means zone failures lose only one replica; traffic automatically shifts to remaining zones. Async replication keeps write latency low (sub millisecond) with milliseconds of replication lag.
Rebalancing Challenges
Adding nodes causes widespread cache misses as keys rehash to new nodes, spiking backend load until cache warms. A sudden 10% key shift can drop hit rate by 10 points and increase database QPS by 100,000+ at scale.
Rebalancing Mitigations
Gradual traffic shifting: mark new node as not ready, background process pre fills from existing nodes before serving traffic. Sticky client maps delay updating hash ring membership for seconds to minutes, allowing warm up time. Using 300+ virtual nodes spreads rebalancing load across all existing nodes rather than concentrating on ring neighbors.