Lag Aware Load Balancing and Health Based Routing

Why Naive Load Balancing Fails
Naive round-robin load balancing across replicas fails in production because replicas are not uniformly healthy. Replication lag varies due to network jitter, write burst absorption, long transaction apply delays, and hardware differences. A replica lagging 5 seconds behind serves stale data and violates user expectations.
A replica at 95% CPU saturation adds hundreds of milliseconds of queuing delay. Round-robin sends equal traffic to the struggling replica, making latency worse for all requests routed there. Intelligent load balancing must consider both freshness and capacity.
Lag-Aware Load Balancing
Lag-aware load balancing continuously measures replica freshness and saturation, routing reads to healthy replicas only. Each replica exposes its current lag (seconds behind primary) and load metrics. The router maintains this state, updated every few seconds or via streaming health checks.
Routing decisions then incorporate lag thresholds. If your application tolerates 1 second staleness, exclude replicas with lag greater than 1 second from the pool. Among eligible replicas, use weighted round-robin based on current load—send more traffic to less loaded replicas.
Adaptive Routing Algorithms
Least-connections routing sends new requests to the replica with fewest active queries. This naturally adapts to varying query costs: a replica executing slow queries accumulates connections and receives less traffic. Fast replicas drain connections quickly and receive more traffic.
Latency-based routing measures actual response times and routes to fastest replicas. This accounts for all factors: replication lag, query processing, and network latency. P99 latency (the response time at the 99th percentile—meaning 99% of requests are faster than this) is often more useful than average for detecting struggling replicas.
Health Check Design
Effective health checks go beyond simple connectivity. A replica can accept connections while being 30 seconds behind on replication. Health checks should query current lag, connection count, CPU utilization, and run a simple test query to verify query execution works.
Health check frequency trades accuracy against overhead. Checking every 100ms gives fresh data but adds load. Checking every 5 seconds misses fast-developing problems. Many systems use tiered checks: lightweight ping every second, detailed metrics every 5 seconds.

💡 Key Takeaways

✓Replication lag varies dramatically in production. Steady state 10 to 100 milliseconds can spike to seconds or minutes during write bursts, long transactions, DDL operations, or replica resource saturation

✓Health scoring combines multiple signals: replication lag from SHOW SLAVE STATUS or pg_stat_replication, CPU and IOPS saturation, query queue depth, and recent error rates to compute per replica scores every 1 to 5 seconds

✓Weighted load balancing routes reads to replicas proportional to health scores, automatically shifting traffic away from degraded instances while keeping them in rotation for non critical queries

✓Circuit breakers remove replicas from rotation when lag exceeds thresholds (commonly 500ms intra region, 2 seconds cross region) or error rates spike, with automatic fallback to primary only reads if all replicas fail

✓Production deployments at Microsoft and Meta commonly run 5 to 10 replicas per shard to tolerate 1 to 2 degraded replicas at any time while maintaining p50 lag under 50ms and p99 under 500ms SLOs

📌 Interview Tips

1Health scoring: Replica A shows 30ms lag, 40% CPU, 0 errors → score 90. Replica B shows 200ms lag, 80% CPU, 3% error rate → score 0, removed from pool. Replica C shows 60ms lag, 55% CPU → score 39, receives low weight.

2Lag spike handling: During a flash sale, write QPS jumps from 2,000 to 15,000. Replicas fall behind: lag goes from 50ms to 3 seconds within 30 seconds. Router detects breach, circuits open, redirects all reads to primary for 2 minutes until replicas catch up.

3Multi signal overload protection: A replica at 50ms lag but query queue depth of 150 and p99 latency of 800ms gets temporarily removed despite acceptable lag, preventing localized saturation from degrading user experience.

← Back to Read Replicas & Query Routing Overview