Resilience & Service PatternsCircuit Breaker PatternMedium⏱️ ~2 min

Circuit Breaker Pattern: Fail Fast to Preserve System Health

The circuit breaker pattern protects your system from wasting resources on failing dependencies by detecting problems and failing fast instead of waiting for timeouts. Think of it like an electrical circuit breaker: when it detects dangerous conditions, it trips open to prevent damage. The pattern works by continuously monitoring calls to a dependency in a sliding window (typically 10 seconds) and tracking failures like timeouts, connection errors, and server errors. When the failure rate crosses a threshold (commonly 50%) and you have enough samples (often 20+ requests), the breaker "opens" and immediately rejects new calls without even trying. This preserves threads, CPU cycles, and prevents cascading failures. After a cooldown period (5 to 30 seconds), it enters a "half open" state to probe recovery with a few trial requests. The genius is in what you're protecting: if a dependency normally responds in 100ms but starts taking 5 seconds due to issues, without a circuit breaker every caller blocks for 5 seconds, exhausting thread pools and propagating slowness upstream. With a circuit breaker, after detecting the problem you fail in under 1ms and can return cached data or a graceful degradation. Netflix famously popularized this pattern with Hystrix, using small thread pools (10 to 20 threads per dependency) and aggressive timeouts (hundreds of milliseconds to 1 second) to isolate failures across hundreds of microservices. The key insight: it's often better to serve degraded functionality immediately than to wait for a sick dependency that might never respond. You're trading perfect responses for predictable latency and system stability.
💡 Key Takeaways
Monitors calls in a sliding window (typically 10 seconds) and tracks failure rate plus slow call rate to detect unhealthy dependencies
Opens when error rate exceeds threshold (commonly 50%) with minimum sample size (often 20 requests) to avoid false positives from low traffic
Preserves resources by failing in under 1ms when open instead of waiting for 5+ second timeouts that exhaust thread pools
Half open state allows controlled probes (1 to 5 concurrent requests) after cooldown (5 to 30 seconds) to test recovery without overwhelming the dependency
Netflix used 10 to 20 thread pools per dependency with Hystrix, keeping blast radius contained so one failing service couldn't starve others
Treats slow successes as failures: if calls exceed latency Service Level Objective (SLO) like 200ms for p95, trip the breaker to preserve tail latency
📌 Examples
Netflix Hystrix production defaults: 10 second rolling window, 20 minimum requests, 50% error threshold, 5 second sleep before half open, hundreds of milliseconds to 1 second timeouts per dependency
Envoy service mesh at Lyft/Shopify: Ejects instances after consecutive 5xx errors over 5 to 10 second intervals, base ejection time 30 to 300 seconds, caps maximum 10% to 50% of hosts ejected to preserve capacity
Alibaba Singles Day with Sentinel: Handles hundreds of thousands of orders per second using slow call ratio triggers to prevent cascading failures during hot partitions
← Back to Circuit Breaker Pattern Overview