Load BalancingSticky SessionsHard⏱️ ~2 min

Deployment and Failover Strategies: Managing State Through Change

The Operational Cost of Stickiness

Deployments and failures are where sticky sessions impose the highest operational cost. In a stateless system, you can instantly cut over traffic or fail over to healthy instances because every backend is interchangeable. With sticky sessions, each instance holds unique user state that will be lost unless you carefully orchestrate the transition. This transforms routine operations into complex, time-sensitive procedures.

Blue-Green Deployment Strategy

For blue-green deployments, reduce the affinity TTL before cutover and drain the old fleet. If your normal TTL is 30 minutes, reduce it to 5 minutes one hour before deployment, then stop sending new sessions to blue instances while allowing existing sessions to complete. The "affinity tail" (percentage of traffic still hitting the old fleet) decays exponentially but can take 2-3x the TTL to drop below 1-5%. During this window, you run both fleets at partial utilization, increasing cost, and any session format incompatibility between versions surfaces as errors.

Rolling Restarts with Long-Lived Connections

Rolling restarts with WebSockets or gRPC present harder problems. If connections average 30-60 minutes and you restart 10% of your fleet every 5 minutes, the last instance will not complete its drain for an hour. Production systems enforce maximum connection age (force disconnect after 30-60 minutes) and stagger restarts to cap concurrent disconnects at 5-10% of total connections. Users see brief reconnection interruptions, but the alternative is hour-long deployment windows.

Failover and Session Loss

Failover must be immediate, which means accepting session loss. The load balancer detects unhealthy backends via failed health checks (typically 3 consecutive failures over 6-15 seconds) and stops routing new requests. Existing sessions on that instance are lost unless the application has implemented active replication or write-through checkpointing. There is no graceful drain when a server crashes; sessions simply disappear.

Hybrid State Management

The practical solution is hybrid state management: critical state (checkout flow position, payment tokens, authentication state) writes to a centralized cache or database within 1-2ms of changes, while ephemeral hot data (recommendation scores, recently viewed items) remains local and is reconstructed on rebind. The cold-cache penalty on rebind is 10-50ms for the first few requests as local caches warm. This hybrid approach preserves sticky session latency benefits while protecting critical state from failure.

Key Insight: Session format changes between versions create incompatibility errors during the affinity tail period when both old and new versions serve traffic. Test session format backwards compatibility before deployment, or plan for longer drain windows to avoid cross-version traffic.
💡 Key Takeaways
Blue-green requires reducing TTL pre-cutover and draining for 2-3x TTL (10-90 minutes) to drop affinity tail below 1-5%
Rolling restarts with WebSockets require max connection age (30-60 min) and staggered restarts capping disconnects at 5-10%
Failover on health failure (6-15s for 3 failures) drops all sessions on unhealthy instance; no graceful drain possible
Hybrid approach: checkpoint critical state to shared store within 1-2ms; accept 10-50ms cold-cache penalty on rebind
📌 Interview Tips
1Walk through blue-green: reduce 30-min TTL to 5-min, drain for 10-15 minutes, affinity tail decays exponentially
2Explain WebSocket restart math: 30-60 min connections, 10% restart every 5 min = 1 hour total drain time
3Describe hybrid state: payment tokens to Redis in 1-2ms (critical), recommendation scores stay local (ephemeral)
← Back to Sticky Sessions Overview