Resilience & Service PatternsLoad Shedding & BackpressureHard

Common Pitfalls and Anti-Patterns

Anti-Pattern: Shedding Too Late

A common mistake is shedding after expensive work. Request authenticates (5ms), validates (2ms), queries database (30ms), then rejects. You wasted 37ms CPU plus database capacity. At 10,000 RPS shed rate, that is 370 seconds CPU wasted per second. Shedding checks should take less than 1ms and occur before any significant work.

Anti-Pattern: Unbounded Queues

Unbounded queues mask overload until memory exhausts. A queue growing from 100 to 1,000,000 looks fine until OOM (Out of Memory) crashes. Bound all queues to 2-5 seconds burst capacity: expected_rps × acceptable_wait.

⚠️ Key Trade-off: Queue sizing balances burst absorption against memory and latency. Start small and increase based on observed patterns.

Anti-Pattern: Retry Amplification

System sheds 1000 requests. All clients retry immediately. Now 2000 requests arrive. Retry loop worsens each iteration. Fix requires exponential backoff with jitter on clients and Retry-After headers from servers.

Anti-Pattern: Priority Inversion

Critical payment request calls optional fraud check service. When fraud service is shed, payments fail. If A depends on B and A is critical, B cannot be optional. Either elevate B priority or make the dependency optional with fallback.

Anti-Pattern: Single Metric Decisions

CPU-only shedding ignores memory pressure. Latency-only ignores whether slowness is internal or external. Use composite health: health = 0.4×cpu + 0.3×memory + 0.3×latency. Shed when health drops below threshold.

Anti-Pattern: No Graceful Recovery

Aggressive shedding without hysteresis creates oscillation. Use hysteresis: shed at 80% CPU, stop only when below 60%. Gradually increase acceptance using AIMD (Additive Increase, Multiplicative Decrease) pattern: increase acceptance by small constant on success, cut by half on overload.

💡 Key Takeaways
Shed before expensive work - rejection after database query wastes 30ms+ of resources
Bound all queues to 2-5 seconds of burst capacity; unbounded queues cause sudden OOM failure
Use composite health scores (CPU + memory + latency), not single metrics for shedding decisions
📌 Interview Tips
1Priority inversion is a sophisticated failure mode to mention - show you think about dependency graphs
2Retry amplification shows understanding of feedback loops in distributed systems
3Mention hysteresis to prevent oscillation: shed at 80%, recover at 60%
← Back to Load Shedding & Backpressure Overview
Common Pitfalls and Anti-Patterns | Load Shedding & Backpressure - System Overflow