Power of Two Choices: Least Requests Algorithm
The Core Problem
The power of two choices algorithm solves a fundamental problem: how do you approximate global least-load decisions with only local state and constant-time overhead? A naive approach would query all backends for their current load, but with 1,000 backends this adds latency and creates a coordination bottleneck. The insight is that comparing just two random choices provides nearly optimal distribution with zero coordination.
The Algorithm
For each incoming request: randomly sample two backend servers, check their in-flight request counts, and route to whichever has fewer active requests. Ties break by configured weight. The load balancer maintains per-backend counters that increment on dispatch and decrement on completion. This requires no centralized coordination or distributed state propagation. Each load balancer instance operates independently using only its local view of in-flight requests.
The Mathematics
The mathematics are remarkable. In the classic balls-into-bins problem, throwing n balls uniformly at random into n bins yields a maximum load of roughly log(n) / log(log(n)) with high probability. Adding just one extra random choice (pick two bins, choose the less loaded) drops maximum load to log(log(n)). For 1,000 servers, this transforms worst-case load from around 145 requests to about 3 requests, a 50x improvement. In production with bursty traffic and heavy-tailed service times, this translates directly to dramatically lower p99 latency.
Scaling Properties
The algorithm scales linearly with load balancer count since each proxy makes independent local decisions with no coordination overhead. Adding more load balancers does not increase contention or require state synchronization. You can tune the choice count (typically 2 or 3) and incorporate weights for heterogeneous capacity. Choosing 3 instead of 2 provides marginal improvement but increases overhead; 2 choices capture most of the benefit.
Failure Mode: Metric Staleness
The failure mode to watch is metric staleness with very short requests. If request completion signals lag by 10-20ms and requests complete in under 5ms, multiple concurrent routing decisions see stale counters and pile onto the same backend. The mitigation: ensure tight completion signaling loops. Modern proxies achieve this through efficient event-driven architectures that update counters synchronously in the dispatch thread, keeping staleness under 1ms.