API Gateway Failure Modes and Resilience Patterns
Single Point of Failure
The gateway is the only path to all services. If it fails, everything fails. Mitigation requires multiple gateway instances behind a load balancer, with health checks removing unhealthy instances. Deploy across availability zones so a zone failure does not take down all gateways. Target 99.99% gateway availability, which allows 52 minutes downtime per year.
Gateway as Bottleneck
All traffic flows through the gateway, making it a potential throughput bottleneck. A gateway handling 10,000 RPS with 10ms processing time needs significant CPU. Horizontal scaling adds instances, but state like rate limit counters must be shared, typically via Redis. Complex aggregation logic increases per request processing time and reduces throughput.
Backend Failures
When backend services fail, the gateway must respond gracefully. Circuit breakers stop sending requests to failing services. Timeouts prevent gateway threads from blocking on slow backends. Fallback responses return cached data or error messages without waiting. Without these patterns, a single slow service exhausts gateway connection pools and degrades all traffic.
5-15ms overhead. Balance protection against performance requirements.Configuration Failures
Bad routing configuration can send all traffic to the wrong service or create routing loops. Use configuration validation before deployment, canary configuration rollouts, and automatic rollback on error rate spikes. Store configuration in version control with code review requirements.
Cascading Timeouts
Client timeout must exceed gateway timeout, which must exceed backend timeout. If client times out at 10s, gateway at 15s, and backend at 20s, the client gives up while gateway and backend continue wasting resources on abandoned requests. Proper ordering: backend 5s, gateway 8s, client 10s.