Data Partitioning and Replication in Key Value Stores
Why Partition Data
A single server has finite memory, CPU, and network capacity. To scale beyond one machine, distribute data across multiple nodes. Each node handles a subset of keys, allowing the cluster to grow by adding nodes. The challenge is deciding which keys go where and keeping data available when nodes fail.
Hash Partitioning
The most common approach: hash the key and assign it to a partition based on the hash value. Simple modulo sharding (hash(key) % num_nodes) works but causes massive data movement when cluster size changes. Adding one node to a 100-node cluster remaps nearly all keys.
Consistent Hashing
Consistent hashing minimizes data movement during cluster changes. Conceptually, keys and nodes map to positions on a circular ring. Each key routes to the first node clockwise from its position. Adding a node only moves keys between that node and its neighbor, typically 1/N of total keys rather than nearly all. Virtual nodes (multiple positions per physical node, typically 100-256) spread load more evenly.
Replication for Availability
Each partition replicates to multiple nodes for fault tolerance. Common configurations replicate to 3 nodes: one primary handles writes and reads, two replicas stand by for failover. When the primary fails, a replica promotes automatically.
Replication Modes
Synchronous replication waits for all replicas to acknowledge before confirming writes. This guarantees durability (data survives any single failure) but adds latency, typically 10-50ms extra per write. Asynchronous replication acknowledges after the primary writes, risking data loss if the primary fails before replicating. Most systems let you choose per-operation based on durability requirements.