Failure Modes: Tail Latency Amplification, Queuing Collapse, and Retry Storms
Tail Latency Amplification
In distributed systems, a single request often fans out to multiple services. If any downstream call is slow, the entire request is slow. This is tail latency amplification: the more services you call, the higher your effective tail latency.
If each of 5 services has p99 latency of 100ms, and you call all 5 in parallel, your p99 is not 100ms. With independent services, the probability that at least one is slow: 1 - (0.99)^5 = 4.9%. Your effective p99 becomes closer to p95 of each service. With 100 parallel calls, nearly every request hits at least one slow service.
Queuing Collapse
Queuing collapse happens when a temporary spike causes queues to grow, and the system never recovers. The spike passes, but the queue is so deep that requests timeout before being processed. Dead requests keep occupying queue space. New requests join a queue of mostly dead work.
The fix requires active queue management. Set queue length limits and reject requests when full (load shedding). Set request deadlines so workers skip requests that have already timed out from the client perspective. Without these protections, a 30 second spike can cause hours of degraded service.
Retry Storms
When a service is slow, clients timeout and retry. Each retry adds more load. The service, already struggling, gets even more requests. More timeouts, more retries. This positive feedback loop can turn a minor slowdown into complete system failure.
A service at 90% capacity experiences a GC pause. Response time spikes. 10% of requests timeout and retry. Load jumps to 99%. More requests timeout. 30% retry. Load exceeds capacity. Queues explode. Now 80% of requests timeout. Retry rate overwhelms everything.
Coordinated Omission
Most load testing tools have a bug called coordinated omission. They wait for a response before sending the next request. If a request takes 5 seconds, they pause for 5 seconds. This hides the true impact of slow responses.
In reality, users do not stop sending requests when your service is slow. Real load is constant. A proper load test sends requests at a fixed rate regardless of response time. The difference is dramatic: tools with coordinated omission might show p99=50ms when true p99 is 5000ms.