Resilience & Service Patterns • Circuit Breaker PatternHard⏱️ ~3 min
Circuit Breaker Failure Modes: Flapping, Stampedes, and Retry Amplification
Circuit breakers themselves can cause outages if misconfigured or interacting badly with other resilience patterns. Understanding these failure modes is critical for production reliability.
Flapping and oscillation happen when breakers open and close rapidly under normal traffic variance. Symptom: spiky availability with p95 latency alternating between 50ms (closed, healthy) and 200ms (open, returning fallbacks). Root cause is usually too small minimum call thresholds (fewer than 10 requests) or too short windows that magnify natural randomness. At 100 Requests Per Second (RPS) with 5% base error rate, a 1 second window will see 0 to 10 errors purely by chance, causing constant tripping. Fix: increase minimum to 20+ calls, use 10+ second windows, and add exponential backoff on open duration so repeated failures extend the wait time (5s, then 10s, then 20s).
Half open stampede occurs when too many probes overwhelm a recovering dependency, causing it to fail again and preventing recovery. If 100 service instances all transition to half open simultaneously and each sends 10 probes, that's 1000 sudden requests to a service that just crashed and is still warming up. The database or API immediately falls over, breaker re opens, and you're stuck in a loop. Amazon and Envoy guidance: limit to 1 to 5 probes total per instance, serialize probe attempts with jittered delays (add random 0 to 5 seconds before probing), and consider token based coordination where only one instance probes at a time.
Retry amplification is deadly: if your client retries twice, your service mesh retries twice, and your SDK retries twice, one user request becomes 8 backend calls. When the circuit breaker opens after detecting failures, those retries don't stop they just get rejected faster, but if you have fallback paths or alternate routes, the retries can overwhelm those too. Real world example: a team saw their Redis fallback cache get 10x normal traffic when the primary database's breaker opened, because every layer was retrying and all retries hit the cache. Fix: implement global retry budgets (max 1 retry per request across all layers), disable retries when the breaker is open, and ensure idempotency so retries are safe.
💡 Key Takeaways
•Flapping occurs with windows under 5 seconds or minimum calls under 10: natural 5% error variance causes constant open/close cycles and spiky p95 latency jumping between 50ms and 200ms
•Exponential backoff on open duration prevents flapping: start at 5 seconds, double on each repeated open (10s, 20s, 40s) up to maximum 300 seconds for persistent outages
•Half open stampede happens when 100 instances probe simultaneously with 10 requests each, creating 1000 sudden requests that re crash the recovering dependency
•Token based probing or strict serialization: only 1 to 5 total probes per instance with 0 to 5 second random jitter prevents synchronized spikes across the fleet
•Retry amplification multiplies load: 2 retries at client, 2 at mesh, 2 at SDK turns 1 request into 8 backend calls, overwhelming fallback systems when primary opens
•Global retry budget and coordination: max 1 retry per request ID across all layers, disable retries when breaker is open, coordinate retry policies between client and infrastructure
📌 Examples
Production flapping incident: 100 RPS service with 2 second window and 5 minimum calls saw breaker open/close every 3 seconds due to natural 5% error rate, fixed by increasing to 10 second window and 20 minimum calls
AWS App Mesh guidance: Limit concurrent probes, cap maximum percentage of hosts ejected at 50% during partial outages to avoid ejecting entire cluster and causing total blackout
Retry storm at scale: Service saw Redis cache overload at 10x normal QPS when primary database breaker opened because all 3 retry layers (client, mesh, SDK) were still active and all hit fallback cache