Resilience & Service Patterns • Circuit Breaker PatternHard⏱️ ~3 min
Circuit Breaker Tuning Knobs: Window Size, Thresholds, and Cooldowns
Effective circuit breakers require careful tuning of multiple interdependent parameters. Getting these wrong causes either excessive false positives that unnecessarily degrade service or breakers that never trip, failing to protect you when needed.
The sliding window configuration is foundational. Time based windows (typically 10 seconds with 10 buckets of 1 second each) smooth out burstiness better than count based windows for variable traffic. The minimum number of calls (often 20) prevents tripping on statistical noise during low traffic periods. If you set this too low, natural variance at 5 requests per second will cause flapping; too high and the breaker won't protect you during genuine outages at lower Query Per Second (QPS) endpoints.
Error rate and slow call thresholds (commonly both set to 50%) define what "unhealthy" means. Crucially, you must classify which errors count: 5xx server errors and timeouts yes, 4xx client errors usually no (unless they indicate overload like 429 rate limits). The slow call threshold is powerful but underused: if your Service Level Objective (SLO) is p95 under 200ms, set slow call threshold at 200ms so the breaker trips before tail latency destroys user experience. This prevents tail amplification where one slow dependency cascades upstream.
Open state duration and probe policy control recovery speed versus stability. Short cooldowns (5 seconds) recover quickly but risk flapping if the issue persists. Longer waits (30+ seconds) with exponential backoff on repeated failures provide stability but reduce availability. The half open probe count (1 to 5 concurrent requests) is critical: too many probes create a thundering herd that re damages a recovering service. AWS and Envoy best practices limit probes strictly and stagger them with jitter across instances to avoid synchronized spikes that look like a new attack to the struggling dependency.
💡 Key Takeaways
•Time based windows of 10 seconds with bucketization smooth burstiness better than count based; use 10 buckets of 1 second each to avoid hot locking on stats updates
•Minimum call threshold around 20 prevents false positives but must scale with traffic: low QPS endpoints may need lower minimums or longer windows to gather samples
•Classify failures explicitly: count 5xx and timeouts, exclude 4xx client errors unless they signal overload (429s), and treat calls exceeding latency SLO as slow failures
•Open duration of 5 to 30 seconds balances recovery speed versus flapping; apply exponential backoff (double on each repeated open) to stabilize under persistent outages
•Limit half open probes to 1 to 5 concurrent requests and serialize with jitter across instances to prevent probe stampedes that overwhelm recovering dependencies
•Scope breakers per endpoint and optionally per tenant to prevent one noisy caller or endpoint from opening the breaker globally and affecting all traffic
📌 Examples
Production tuning for interactive API: 10 second window, 20 minimum calls, 50% error threshold, 200ms slow call threshold (matches p95 SLO), 10 second cooldown, 3 concurrent probes in half open
Envoy outlier detection config: 5 second interval, eject after 5 consecutive 5xx errors, 30 second base ejection time, maximum 50% of cluster ejected, prevents total cluster blackout
Low QPS background job breaker: 60 second window, 10 minimum calls (may only get 1 QPS), 70% error threshold, 60 second cooldown to avoid tripping on occasional failures