Partitioning & ShardingConsistent HashingMedium⏱️ ~3 min

Implementation Patterns: Placement, Replication, and Bounded Load

Ring based placement maintains a sorted array of (hash position, node ID, vnode index) tuples. Key lookup hashes the key into the same space and binary searches for the first position greater than or equal to the key hash, wrapping around if needed. With 100 nodes and 256 vnodes each, that is 25,600 entries; binary search depth is log2(25600) which equals approximately 15 comparisons. Each comparison may incur a cache miss on large datasets, so expect 1 to 2 microseconds total lookup time on modern CPUs, negligible compared to network or disk input/output (I/O). Replication on a ring picks the next R minus 1 distinct physical nodes clockwise, skipping vnodes belonging to already selected nodes and enforcing failure domain constraints. For Replication Factor (RF) equals 3 across 3 availability zones, you pick the first node in each AZ encountered clockwise. This successor based approach is simple but can concentrate replicas if the ring is poorly balanced; rendezvous hashing naturally picks top K nodes by score without adjacency concerns. Bounded load selection prevents hotspots by capping any node at (1 + epsilon) times the mean load. Epsilon of 0.1 allows 10% overload; epsilon of 0.25 allows 25%. On lookup, if the selected node is above cap, choose the next candidate (next clockwise on ring, or next highest score in rendezvous). Track current load per node in a shared state updated on every request assignment. This requires coordination but prevents chronic overload. Google's bounded loads consistent hashing paper demonstrates epsilon of 0.25 keeps peak to mean under 1.05 even with adversarial key distributions. Membership management distributes versioned node lists to clients via control plane (etcd, Consul, ZooKeeper) or gossip protocols. Use monotonic epoch numbers; clients reject older epochs. During topology changes, run both old and new assignments in parallel for a grace period (e.g., 30 seconds), allowing clients to converge without split brain. Meta mcrouter uses this pattern with pool configs versioned and distributed via configuration service.
💡 Key Takeaways
Ring lookup performance: Binary search through 25,600 vnodes (100 nodes × 256 vnodes) takes approximately 15 comparisons; cache miss per comparison yields 1 to 2 microsecond total, negligible vs millisecond network or disk I/O
Successor based replication: Pick next RF minus 1 distinct physical nodes clockwise, skipping same node vnodes and enforcing Availability Zone (AZ) diversity; Cassandra with RF=3 places replicas in 3 different AZs
Bounded load with epsilon 0.1: Cap any node at 1.1x mean load; if selected node overloaded, pick next candidate; requires per node load tracking but prevents chronic hotspots
Versioned membership with grace period: Use monotonic epoch numbers; during changes, accept keys for both old and new assignments for 30 to 60 seconds, allowing clients to converge without miss storms
Practical vnode allocation: Weighted capacity via proportional vnodes; a node with 2x CPU gets 2x vnodes (e.g., 512 vs 256), naturally receiving 2x keys without manual intervention
📌 Examples
Cassandra token assignment: Each node claims 256 random tokens (vnodes) on startup; token to node map stored in system.peers table and gossiped; lookups binary search sorted token list
Amazon Dynamo replication: RF=3 with successor based placement; write coordinator forwards to next 2 clockwise nodes in different AZs; quorum write (W=2) returns after 2 of 3 acknowledge
Google bounded loads: epsilon=0.25 allows peak node load of 1.25x mean; next candidate selection on overload; measured peak to mean ratio of 1.05 even with 20% keys concentrated on 5% of keyspace
Meta mcrouter pool config: JSON config with node list and weights distributed via config service; version number incremented on change; clients refresh every 10 seconds and apply new config atomically
← Back to Consistent Hashing Overview
Implementation Patterns: Placement, Replication, and Bounded Load | Consistent Hashing - System Overflow