Circuit Breaker Failure Modes: Flapping, Stampedes, and Retry Amplification
Flapping
When a breaker rapidly cycles between open and closed states. The downstream service is partially degraded, causing it to pass half open tests but fail under normal load. Each close brings full traffic, which immediately causes failures, triggering another open. The fix is hysteresis: require multiple consecutive successful test periods before fully closing, or gradually increase traffic during recovery instead of jumping to full load.
Thundering Herd
When multiple breakers close simultaneously and all send traffic to a recovering service. If 100 service instances have open breakers and they all close at the same time, the downstream service receives 100x its normal traffic instantly. The fix is jittered cooldowns: add random variation to cooldown periods (30s ± 5s) so breakers close at different times, spreading recovery load.
Retry Amplification
When circuit breakers and retries interact badly. Service A retries 3 times to Service B. B retries 3 times to C. A single C failure causes 9 requests (3 × 3). Add another layer and you get 27 requests. Circuit breakers should trigger before retries are exhausted. If your retry budget is 3 attempts with 1s timeout each, set circuit breaker window shorter than 3s so it opens before full retry amplification occurs.
Partial Failures
When some endpoints or methods on a service fail while others succeed. A breaker that trips on any failure will block healthy endpoints too. The solution is per endpoint breakers or per method breakers, but this adds configuration complexity. An alternative is using adaptive thresholds that account for baseline error rates.