Testing and Validating Degradation Behavior

Chaos Engineering for Degradation
The only way to verify degradation works is testing with real failures. Chaos engineering (deliberately injecting failures into production) validates degradation paths. Kill service instances, introduce network latency, exhaust connection pools. Start with non-critical services in staging, expand to production during low traffic.
Failure Injection Techniques
Service termination: kill processes to test failover. Network partition: block traffic using iptables. Latency injection: add 500ms delay. Error injection: return 500 errors for percentage of requests. Resource exhaustion: consume memory or CPU. Each reveals different behaviors.
⚠️ Key Trade-off: Production chaos testing risks real user impact. Mitigate with percentage controls (1% of users), time bounds (5 minute experiments), automatic rollback triggers.
Degradation Test Scenarios
Define explicit scenarios for each path. For recommendation degradation: inject failure, verify circuit breaker trips within 30 seconds, verify fallback returns popular products, verify checkout unaffected, verify recovery when restored.
Game Days
Planned exercises where teams practice incident response. Schedule monthly or quarterly. Create realistic scenarios: "Database primary fails during peak traffic." Time each phase against SLOs (Service Level Objectives). Reveals operational gaps: missing runbooks, unclear escalation paths.
Monitoring Degradation State
Traffic light indicators per feature: green (normal), yellow (degraded), red (critical). Track requests served by fallback vs primary. Alert when degradation exceeds 5 minutes (warn), 15 minutes (page).

💡 Key Takeaways

✓Chaos engineering validates degradation with real failures - kill services, inject latency, exhaust resources

✓Game days practice incident response with planned failure scenarios

✓Monitor degradation state with traffic lights per feature and alert on extended degradation

📌 Interview Tips

1Mention chaos engineering by name - shows knowledge of modern resilience practices

2Describe specific injection techniques: network partition, latency injection, error injection

3Game days demonstrate operational maturity beyond just technical design

← Back to Graceful Degradation Overview