Circuit Breaker Failure Modes: Flapping, Stampedes, and Retry Amplification

Flapping
When a breaker rapidly cycles between open and closed states. The downstream service is partially degraded, causing it to pass half open tests but fail under normal load. Each close brings full traffic, which immediately causes failures, triggering another open. The fix is hysteresis: require multiple consecutive successful test periods before fully closing, or gradually increase traffic during recovery instead of jumping to full load.
Thundering Herd
When multiple breakers close simultaneously and all send traffic to a recovering service. If 100 service instances have open breakers and they all close at the same time, the downstream service receives 100x its normal traffic instantly. The fix is jittered cooldowns: add random variation to cooldown periods (30s ± 5s) so breakers close at different times, spreading recovery load.
Retry Amplification
When circuit breakers and retries interact badly. Service A retries 3 times to Service B. B retries 3 times to C. A single C failure causes 9 requests (3 × 3). Add another layer and you get 27 requests. Circuit breakers should trigger before retries are exhausted. If your retry budget is 3 attempts with 1s timeout each, set circuit breaker window shorter than 3s so it opens before full retry amplification occurs.
⚠️ Key Trade-off: Every mitigation adds complexity. Jittered cooldowns require coordination. Gradual recovery requires traffic shaping. Retry budgets require careful tuning across service boundaries. Start simple and add mitigations only when you observe these failure modes.
Partial Failures
When some endpoints or methods on a service fail while others succeed. A breaker that trips on any failure will block healthy endpoints too. The solution is per endpoint breakers or per method breakers, but this adds configuration complexity. An alternative is using adaptive thresholds that account for baseline error rates.

💡 Key Takeaways

✓Flapping occurs when partially degraded services pass tests but fail under load. Fix with hysteresis or gradual traffic increase.

✓Thundering herd happens when multiple breakers close simultaneously. Fix with jittered cooldowns adding random variation.

✓Retry amplification multiplies failures exponentially. Set breaker windows shorter than total retry budget to catch amplification early.

📌 Interview Tips

1When discussing resilience, proactively mention thundering herd and jittered cooldowns as an advanced consideration

2Calculate retry amplification: with 3 retries per layer across 3 services, a single failure creates 27 downstream requests

3Mention per endpoint breakers when discussing services with mixed critical and non critical methods

← Back to Circuit Breaker Pattern Overview