Testing and Validating Degradation Behavior
Chaos Engineering for Degradation
The only way to verify degradation works is testing with real failures. Chaos engineering (deliberately injecting failures into production) validates degradation paths. Kill service instances, introduce network latency, exhaust connection pools. Start with non-critical services in staging, expand to production during low traffic.
Failure Injection Techniques
Service termination: kill processes to test failover. Network partition: block traffic using iptables. Latency injection: add 500ms delay. Error injection: return 500 errors for percentage of requests. Resource exhaustion: consume memory or CPU. Each reveals different behaviors.
Degradation Test Scenarios
Define explicit scenarios for each path. For recommendation degradation: inject failure, verify circuit breaker trips within 30 seconds, verify fallback returns popular products, verify checkout unaffected, verify recovery when restored.
Game Days
Planned exercises where teams practice incident response. Schedule monthly or quarterly. Create realistic scenarios: "Database primary fails during peak traffic." Time each phase against SLOs (Service Level Objectives). Reveals operational gaps: missing runbooks, unclear escalation paths.
Monitoring Degradation State
Traffic light indicators per feature: green (normal), yellow (degraded), red (critical). Track requests served by fallback vs primary. Alert when degradation exceeds 5 minutes (warn), 15 minutes (page).