Circuit Breaker Tuning Knobs: Window Size, Thresholds, and Cooldowns
Failure Rate Window
The window defines how much history the breaker considers. A 10 second sliding window means only recent requests count. Short windows react quickly but trip on network hiccups. Longer windows are stable but slow to detect real outages. Count based windows (last 100 requests) suit variable traffic; time based windows suit steady traffic.
Failure Threshold
The percentage that triggers the open state. A 50% threshold opens when half of requests fail. Critical services use 25%; variable latency services need 70%. A minimum request volume of 20 requests prevents opening on statistical noise during low traffic.
Cooldown Period
How long the breaker stays open before testing recovery. A 30 second cooldown gives downstream services time to recover. Too short and you hammer recovering services. Exponential backoff starts at 30s, doubles each failure, caps at 5 minutes.
Half Open Behavior
When cooldown expires, the breaker allows 3 to 5 test requests through. All succeed: close and resume normal traffic. Any fail: reopen and restart cooldown. Advanced implementations gradually increase traffic: 10%, 25%, 50%, then full.
Tuning by Service Type
Payment: tight thresholds (25%/10s) because errors are costly. Search: higher tolerance (60%/30s) because degraded results are acceptable. Start with 50%/10s/30s defaults and tune from production data.