Load Balancing • Sticky SessionsHard⏱️ ~2 min
Deployment and Failover Strategies: Managing State Through Change
Deployments and failures are where sticky sessions impose the highest operational cost. In a stateless system, you can instantly cut over traffic or fail over to healthy instances because every backend is interchangeable. With sticky sessions, each instance holds unique user state that will be lost unless you carefully orchestrate the transition.
For blue green deployments, the standard approach is to reduce the affinity TTL before cutover and begin draining the old (blue) fleet. If your normal TTL is 30 minutes, you might reduce it to 5 minutes one hour before deployment, then stop sending new sessions to blue instances while allowing existing sessions to complete. The "affinity tail" (the percentage of traffic still hitting the old fleet) decays exponentially but can take 2 to 3 times the TTL to drop below 1 to 5 percent. During this window, you're running both blue and green fleets at partial utilization, increasing cost, and any session format incompatibility between versions will surface as errors.
Rolling restarts with long lived connections such as WebSockets present an even harder problem. If connections average 30 to 60 minutes and you restart 10 percent of your fleet every 5 minutes, the last instance won't complete its drain for an hour. Production systems enforce maximum connection age (force disconnect after 30 to 60 minutes) and stagger restarts to cap concurrent disconnects at 5 to 10 percent of total connections, accepting that users will see brief reconnection interruptions.
Failover to handle instance or availability zone failures must be immediate, which means accepting session loss. The load balancer detects unhealthy backends via failed health checks (typically three consecutive failures over 6 to 15 seconds) and stops routing new requests. Existing sessions on that instance are lost unless the application has implemented active replication or write through checkpointing to a shared store. Netflix and Amazon retail address this with hybrid approaches: critical state (checkout flow position, payment tokens) writes to DynamoDB or a cache within 1 to 2 milliseconds, while ephemeral hot data (product recommendation scores, recently viewed items) remains local and is reconstructed on rebind with a cold cache penalty of 10 to 50 milliseconds for the first few requests.
💡 Key Takeaways
•Blue green deployments require reducing affinity TTL pre cutover and draining old fleet for 2 to 3 times the TTL (10 to 90 minutes) to drop affinity tail below 1 to 5 percent
•Rolling restarts with WebSockets require enforced maximum connection age (30 to 60 minutes) and staggered restarts capping concurrent disconnects at 5 to 10 percent to avoid user impact spikes
•Failover on health check failure (6 to 15 seconds for three consecutive failures) drops all sessions on the unhealthy instance unless replicated or checkpointed
•Hybrid state management checkpoints critical state to shared store within 1 to 2 milliseconds while keeping hot caches local, accepting 10 to 50 millisecond cold cache penalty on rebind
•Session format changes between versions create incompatibility errors during the affinity tail period when both old and new versions serve traffic for the same users
📌 Examples
Amazon retail checkout flow writes cart and shipping address to DynamoDB on every update (under 2ms p99), allowing failover to any instance with only product recommendation cache cold start penalty
Netflix API gateway enforces 45 minute maximum streaming session age, staggering rolling restart so only 8 percent of users reconnect in any 5 minute window
Microsoft Azure Application Gateway drain timeout of 5 minutes: instances marked unhealthy stop receiving new sessions but continue serving existing ones until timeout or natural completion