Weighted Load Balancing and Slow Start for Heterogeneous Fleets

Why Weighted Distribution Matters
Real production fleets are never homogeneous. Servers run different CPU generations with 20-40% performance differences between generations. Newly added instances have cold caches with 2-5x higher latency until warmed. Cloud environments suffer noisy neighbor effects where co-located workloads compete for shared resources. Weighted load balancing encodes these capacity differences by assigning each backend a weight proportional to its capacity and distributing traffic accordingly. A server with weight 200 receives twice the traffic of one with weight 100.
Smooth Distribution Without Bursts
The challenge is smooth distribution without micro-bursts. Naive weighted round robin sends traffic in proportion to weights but creates bursts: with weights [100, 200, 100], you might route 1 request to server A, then 2 consecutive to server B, then 1 to server C, repeating. This causes queue depth spikes on server B every cycle. Smooth weighted round robin algorithms interleave selections to approximate weights over short windows. For weights [100, 200, 100], the sequence might be A, B, C, B, A, B, C, B... spreading B requests evenly rather than clustering them.
Slow Start for Cold Instances
Slow start addresses cold start problems. When you add a new instance or one recovers from failure, its caches are empty and JIT (Just-In-Time) compilation has not warmed up. Immediately sending full traffic causes tail latency spikes as every request misses cache and triggers compilation. Slow start begins at 10-20% of nominal weight and ramps linearly or exponentially over 30-300 seconds. For cache-heavy applications, 60-120 seconds is typically sufficient to warm working set. Without slow start, deploying new instances during traffic spikes can make latency worse, not better.
Auto-Tuning Weights: Power and Risk
Auto-tuning weights from live signals is powerful but risky. You can adjust weights based on CPU utilization, error rates, or SLO attainment. However, mis-set weights cause chronic imbalance. Example: A backend experiencing transient garbage collection pauses sees its weight reduced. If adjustment is too aggressive or slow to revert, the instance never recovers its share and the cluster runs under capacity. The backend sits at low utilization while others are overloaded, unable to help because its weight keeps it starved of traffic.
Safe Auto-Tuning Implementation
Safe implementations use multiple guardrails: bounded ranges (never below 20% or above 150% of baseline weight), slow adjustment rates (5-10% per minute maximum), and circuit breakers that disable auto-tuning if cluster-wide error rates exceed thresholds. The weight adjustment algorithm should be symmetric: if it takes 5 minutes to reduce weight by 50%, it should take 5 minutes to restore, not longer. Asymmetric recovery causes capacity to bleed away over time as transient issues accumulate weight reductions.
Key Insight: Weighted load balancing is essential for heterogeneous fleets, but the devil is in implementation details. Smooth distribution prevents bursts, slow start prevents cold-instance latency spikes, and safe auto-tuning guardrails prevent chronic capacity starvation.

💡 Key Takeaways

✓Heterogeneous fleets: 20-40% CPU generation differences, 2-5x cold cache latency; weights encode these capacity differences

✓Smooth weighted round robin interleaves selections to avoid bursts; naive approach clusters consecutive requests to high-weight servers

✓Slow start ramps new instances from 10-20% to 100% weight over 30-300 seconds; 60-120s typical for cache-heavy apps

✓Safe auto-tuning: bounded 20-150% range, 5-10%/minute adjustment rate, circuit breaker on cluster error rate

📌 Interview Tips

1Explain burst problem: weights [100, 200, 100] with naive round robin sends consecutive requests to server B, spiking its queue

2Describe slow start scenario: new instance with cold cache receives full traffic, every request misses cache, latency spikes

3Walk through auto-tuning failure: GC pause reduces weight, instance never recovers share, sits idle while others overloaded

← Back to Load Balancing Algorithms Overview