Resilience & Service PatternsGraceful DegradationHard

Testing and Validating Degradation Behavior

Chaos Engineering for Degradation

The only way to verify degradation works is testing with real failures. Chaos engineering (deliberately injecting failures into production) validates degradation paths. Kill service instances, introduce network latency, exhaust connection pools. Start with non-critical services in staging, expand to production during low traffic.

Failure Injection Techniques

Service termination: kill processes to test failover. Network partition: block traffic using iptables. Latency injection: add 500ms delay. Error injection: return 500 errors for percentage of requests. Resource exhaustion: consume memory or CPU. Each reveals different behaviors.

⚠️ Key Trade-off: Production chaos testing risks real user impact. Mitigate with percentage controls (1% of users), time bounds (5 minute experiments), automatic rollback triggers.

Degradation Test Scenarios

Define explicit scenarios for each path. For recommendation degradation: inject failure, verify circuit breaker trips within 30 seconds, verify fallback returns popular products, verify checkout unaffected, verify recovery when restored.

Game Days

Planned exercises where teams practice incident response. Schedule monthly or quarterly. Create realistic scenarios: "Database primary fails during peak traffic." Time each phase against SLOs (Service Level Objectives). Reveals operational gaps: missing runbooks, unclear escalation paths.

Monitoring Degradation State

Traffic light indicators per feature: green (normal), yellow (degraded), red (critical). Track requests served by fallback vs primary. Alert when degradation exceeds 5 minutes (warn), 15 minutes (page).

💡 Key Takeaways
Chaos engineering validates degradation with real failures - kill services, inject latency, exhaust resources
Game days practice incident response with planned failure scenarios
Monitor degradation state with traffic lights per feature and alert on extended degradation
📌 Interview Tips
1Mention chaos engineering by name - shows knowledge of modern resilience practices
2Describe specific injection techniques: network partition, latency injection, error injection
3Game days demonstrate operational maturity beyond just technical design
← Back to Graceful Degradation Overview