Load BalancingLoad Balancing AlgorithmsMedium⏱️ ~3 min

Weighted Load Balancing and Slow Start for Heterogeneous Fleets

Real production fleets are never homogeneous. Servers run different CPU generations (Intel Skylake vs Ice Lake), experience different cache hit rates (newly added instances have cold caches), and suffer noisy neighbor effects in cloud environments. Weighted load balancing encodes these capacity differences by assigning each backend a weight proportional to its capacity and distributing traffic accordingly. A server with weight 200 receives twice the traffic of one with weight 100. The challenge is smooth distribution without micro bursts. Naive weighted round robin sends traffic in proportion to weights but can create bursts: with weights [100, 200, 100], you might route 1 request to server A, then 2 consecutive to server B, then 1 to server C, repeating the pattern. This causes spikes in queue depth. Smooth weighted round robin algorithms interleave selections to approximate weights over short windows, avoiding bursts while maintaining long term proportions. Slow start addresses cold start problems. When you add a new instance or one recovers from failure, its caches are empty and just in time (JIT) compilation hasn't warmed up. Immediately sending full traffic causes tail latency spikes. Slow start begins at 10 to 20% of nominal weight and ramps linearly or exponentially over 30 to 300 seconds. AWS Application Load Balancer supports configurable slow start durations, defaulting to 0 (disabled) but recommended at 60 to 120 seconds for cache heavy applications. Auto tuning weights from live signals is powerful but risky. You can adjust weights based on CPU utilization, error rates, or Service Level Objective (SLO) attainment. However, mis set weights cause chronic imbalance. Example: A backend experiencing transient garbage collection (GC) pauses sees its weight reduced. If the adjustment is too aggressive or slow to revert, the instance never recovers its share and the cluster runs under capacity. Safe implementations use bounded ranges (never below 20% or above 150% of baseline), slow adjustment rates (5 to 10% per minute), and circuit breakers that disable auto tuning if cluster wide error rates exceed thresholds.
💡 Key Takeaways
Weighted round robin distributes traffic proportional to capacity (weight 200 gets 2x traffic vs weight 100), critical for heterogeneous fleets with different CPU generations or cache states
Smooth weighted algorithms interleave selections to avoid micro bursts. Naive approaches can route consecutive requests to high weight servers, causing queue depth spikes
Slow start ramps new or recovered instances from 10 to 20% weight up to 100% over 30 to 300 seconds. AWS ALB defaults to 0 but recommends 60 to 120 seconds for cache heavy apps
Auto tuning weights from CPU or error rate signals is powerful but dangerous. Mis calibrated adjustments cause chronic imbalance where struggling instances never recover their share
Safe auto tuning requires bounded ranges (20 to 150% of baseline), slow rates (5 to 10% adjustment per minute), and circuit breakers that disable tuning if cluster error rates exceed thresholds
📌 Examples
AWS ALB target group: Three instances with weights [100, 200, 150]. Total weight 450. Instance B receives 200 divided by 450 equals 44% of traffic. Slow start duration set to 120 seconds ramps from 22 RPS to 220 RPS over 2 minutes
Google production: Fleet mixing Intel Skylake (weight 100) and Ice Lake (weight 140) instances. Weighted distribution prevents Skylake overload, keeping p99 under 200ms vs 500ms with equal distribution
Azure auto scaling: New virtual machine (VM) added to pool handling 10,000 RPS across 10 instances. Without slow start, new VM immediately gets 1,000 RPS with cold cache, causing p99 spike to 2 seconds. With 60s slow start, p99 stays under 300ms
← Back to Load Balancing Algorithms Overview
Weighted Load Balancing and Slow Start for Heterogeneous Fleets | Load Balancing Algorithms - System Overflow