Bulkhead Monitoring: Detecting Exhaustion Before Cascade Failures
Key Metrics to Track
Pool utilization: percentage of threads or connections in use. Rejection rate: requests rejected due to full pool. Queue depth: requests waiting in queue. Wait time: how long requests wait before execution. These metrics reveal bulkhead health before failures cascade to users.
Alert Thresholds
Alert before exhaustion, not after. Utilization at 70% is warning; 90% is critical. Rejection rate above 1% needs investigation. Queue depth growing continuously indicates insufficient capacity. Wait time exceeding SLA thresholds (100ms) signals degradation.
Dashboard Design
Show all bulkheads on one dashboard: pool name, current utilization, rejection count, queue depth. Color code by health: green under 70%, yellow 70-90%, red above 90%. During incidents, this dashboard instantly shows which bulkhead is under stress and containing the failure.
Capacity Planning
Track peak utilization over time. If Pool A consistently hits 80% during daily peaks, increase its size. If Pool B never exceeds 30%, resources may be wasted. Review bulkhead sizing quarterly based on actual usage patterns. Traffic growth means bulkhead sizes must grow proportionally.
Correlating with Downstream Health
Bulkhead stress often indicates downstream problems. When Pool A utilization spikes, check Service A latency and error rates. The bulkhead is a symptom; the downstream service is the cause. Dashboards should link bulkhead metrics with downstream service health for quick root cause identification.