Database Design • Read Replicas & Query RoutingHard⏱️ ~2 min
Lag Aware Load Balancing and Health Based Routing
Naive round robin load balancing across replicas fails in production because replicas are not uniformly healthy. Replication lag varies due to network jitter, write burst absorption, long running transaction apply delays, and hardware differences. A replica lagging 5 seconds behind will serve stale data and violate user expectations. A replica at 95 percent Central Processing Unit (CPU) saturation will add hundreds of milliseconds of queuing delay. Lag aware load balancing continuously measures replica freshness and saturation, routing reads to the healthiest options.
The core metric is replication lag: the time or log position difference between primary and replica. Most databases expose this through replication status commands (MySQL's SHOW SLAVE STATUS gives Seconds_Behind_Master, PostgreSQL's pg_stat_replication shows replay_lag). Routers poll these metrics every 1 to 5 seconds and maintain a per replica health score. A common scheme: score equals 100 minus lag_ms divided by 10 minus cpu_percent, clamped to zero. A replica at 50ms lag and 70 percent CPU scores 100 minus 5 minus 70 equals 25. A replica at 10ms lag and 30 percent CPU scores 100 minus 1 minus 30 equals 69. Reads route to replicas with scores above a threshold (say 50) using weighted random selection.
Lag spikes are inevitable. Heavy write bursts generate log volume faster than replicas can apply. Long running transactions on primary block replication apply threads waiting for locks. Large Data Definition Language (DDL) operations like adding indexes replay slowly. Misconfigured replicas with inadequate CPU or disk Input/Output Operations Per Second (IOPS) fall permanently behind. In production, p99 lag spikes to seconds or minutes are common during peak traffic or maintenance windows. Routers must detect these spikes quickly and circuit break: temporarily remove the lagging replica from rotation, log alerts, and optionally fall back to primary only reads if all replicas are unhealthy.
Microsoft and Meta run large fleets where per shard deployments include 5 to 10 replicas specifically to tolerate 1 to 2 replicas being degraded at any time. They set Service Level Objectives (SLOs) like p50 lag under 50ms and p99 under 500ms, with automated paging when breached. They also implement overload protection: if a replica's query queue depth exceeds 100, stop sending new queries there even if lag is acceptable, preventing localized saturation from cascading. This multi signal health model (lag, CPU, IOPS, queue depth, error rate) is necessary at scale to maintain both consistency and availability.
💡 Key Takeaways
•Replication lag varies dramatically in production. Steady state 10 to 100 milliseconds can spike to seconds or minutes during write bursts, long transactions, DDL operations, or replica resource saturation
•Health scoring combines multiple signals: replication lag from SHOW SLAVE STATUS or pg_stat_replication, CPU and IOPS saturation, query queue depth, and recent error rates to compute per replica scores every 1 to 5 seconds
•Weighted load balancing routes reads to replicas proportional to health scores, automatically shifting traffic away from degraded instances while keeping them in rotation for non critical queries
•Circuit breakers remove replicas from rotation when lag exceeds thresholds (commonly 500ms intra region, 2 seconds cross region) or error rates spike, with automatic fallback to primary only reads if all replicas fail
•Production deployments at Microsoft and Meta commonly run 5 to 10 replicas per shard to tolerate 1 to 2 degraded replicas at any time while maintaining p50 lag under 50ms and p99 under 500ms SLOs
📌 Examples
Health scoring: Replica A shows 30ms lag, 40% CPU, 0 errors → score 90. Replica B shows 200ms lag, 80% CPU, 3% error rate → score 0, removed from pool. Replica C shows 60ms lag, 55% CPU → score 39, receives low weight.
Lag spike handling: During a flash sale, write QPS jumps from 2,000 to 15,000. Replicas fall behind: lag goes from 50ms to 3 seconds within 30 seconds. Router detects breach, circuits open, redirects all reads to primary for 2 minutes until replicas catch up.
Multi signal overload protection: A replica at 50ms lag but query queue depth of 150 and p99 latency of 800ms gets temporarily removed despite acceptable lag, preventing localized saturation from degrading user experience.