Resilience & Service PatternsAPI Gateway PatternsHard⏱️ ~3 min

API Gateway Failure Modes and Resilience Patterns

API Gateway failures can halt all traffic to your system within seconds making resilience patterns critical. The most severe failure is a single point of failure outage where one gateway cluster going down stops all client requests. Mitigation requires multi availability zone deployment with stateless data plane, health aware DNS that fails over to alternate regions in under 60 seconds, and validating that control plane outages (configuration management) do not degrade the actively running data plane. Retry storms and amplification turn partial outages into total failures. When backends slow down, clients retry and the gateway may also retry creating 2x to 10x load multiplication. A single gateway retry plus two client retries means one user action generates four backend requests. Enforce budgets allowing maximum one retry for idempotent operations (GET, PUT, DELETE) with jittered exponential backoff and zero retries for non idempotent operations (POST). Circuit breakers per upstream should open after 50 percent error rate over a 20 request window preventing further attempts for 30 to 60 second cool down periods. Cache stampedes occur when a hot key expires and 10,000 requests simultaneously hit the backend causing 30 second latency spikes and potential database overload. Request coalescing collapses concurrent identical requests into one backend call. Stale while revalidate serves expired cache entries for up to 60 seconds while asynchronously refreshing in background. Jittered TTL values (15 minutes plus or minus 2 minutes random) prevent synchronized expiration. Protocol translation quirks like HTTP/2 to HTTP/1.1 can break streaming semantics and flow control. Long lived streams require gateways that do not buffer entire responses or time out idle connections after 30 seconds. Certificate expiry at the edge causes global outages within minutes since all clients see TLS handshake failures. Automate certificate issuance and renewal with alerts 30 days before expiration and run monthly chaos drills shutting down gateway clusters to validate failover.
💡 Key Takeaways
Single point of failure requires multi availability zone stateless deployment with health aware DNS failing over in under 60 seconds to alternate regions
Retry amplification (client retry times gateway retry) multiplies load 2x to 10x during outages; enforce maximum one retry for idempotent with jittered backoff
Cache stampede when hot key expires causes 10,000 concurrent backend requests creating 30 second latency spikes; use request coalescing and stale while revalidate up to 60 seconds
Circuit breakers per upstream open after 50 percent error rate over 20 request window with 30 to 60 second cool down preventing cascading failures
Protocol translation HTTP/2 to HTTP/1.1 can break streaming and flow control; avoid gateways that buffer entire responses or timeout idle connections under 5 minutes
TLS certificate expiry causes global outages in minutes since all clients see handshake failures; automate renewal with 30 day alerts and monthly chaos drill cluster shutdowns
📌 Examples
E-commerce site cache stampede: popular product expires, 50K requests hit inventory service simultaneously, database connection pool exhausted, 503 errors for 45 seconds until circuit breaker opens
Payment gateway retry storm: backend latency increases to 2 seconds, clients retry after 1 second, gateway also retries, effective load becomes 8x causing total outage until retries disabled
Video streaming service certificate expiry: automated renewal failed silently, cert expired at 2am, 100% of users unable to connect globally, 45 minute outage until manual cert deployment
← Back to API Gateway Patterns Overview