Event Loop Failure Modes: Stalls, Backpressure, and Thundering Herds
Event Loop Stalls
Event loops must not block. One blocking call stalls all connections. A handler that takes 100 milliseconds blocks 100,000 other connections for that duration. P99 latency spikes to seconds.
Common stall causes: synchronous file I/O, DNS lookups, CPU intensive computation in handlers, accidentally calling blocking APIs. Monitor event loop lag: time between iterations should be microseconds. Millisecond lags indicate problems. Second lags are emergencies.
Backpressure Failures
When producers outpace consumers, buffers grow. Without backpressure, memory exhausts. Event loops accumulate data faster than handlers process it. Each connection queues data. 10,000 connections with 10 KB buffered is 100 MB. Growth continues until OOM.
Implement backpressure by pausing reads when buffers fill. Stop accepting new connections when at capacity. Return 503 rather than accepting requests you cannot handle. The goal is graceful degradation: serve some requests well rather than all requests poorly.
Thundering Herd
Multiple threads or processes waiting on the same event wake simultaneously. If all 64 workers epoll_wait on a listen socket, one incoming connection wakes all 64. Only one accepts successfully. The other 63 wasted CPU and context switches.
Solutions: EPOLLEXCLUSIVE wakes only one waiter. SO_REUSEPORT creates separate accept queues per thread. Load balancer distributes connections before they reach the server. Modern kernels and libraries handle this, but misconfiguration still causes thundering herds.