OS & Systems FundamentalsI/O Models (Blocking, Non-blocking, Async)Hard⏱️ ~3 min

Event Loop Failure Modes: Stalls, Backpressure, and Thundering Herds

Event Loop Stalls

Event loops must not block. One blocking call stalls all connections. A handler that takes 100 milliseconds blocks 100,000 other connections for that duration. P99 latency spikes to seconds.

Common stall causes: synchronous file I/O, DNS lookups, CPU intensive computation in handlers, accidentally calling blocking APIs. Monitor event loop lag: time between iterations should be microseconds. Millisecond lags indicate problems. Second lags are emergencies.

Backpressure Failures

When producers outpace consumers, buffers grow. Without backpressure, memory exhausts. Event loops accumulate data faster than handlers process it. Each connection queues data. 10,000 connections with 10 KB buffered is 100 MB. Growth continues until OOM.

Implement backpressure by pausing reads when buffers fill. Stop accepting new connections when at capacity. Return 503 rather than accepting requests you cannot handle. The goal is graceful degradation: serve some requests well rather than all requests poorly.

Thundering Herd

Multiple threads or processes waiting on the same event wake simultaneously. If all 64 workers epoll_wait on a listen socket, one incoming connection wakes all 64. Only one accepts successfully. The other 63 wasted CPU and context switches.

Solutions: EPOLLEXCLUSIVE wakes only one waiter. SO_REUSEPORT creates separate accept queues per thread. Load balancer distributes connections before they reach the server. Modern kernels and libraries handle this, but misconfiguration still causes thundering herds.

🎯 When To Use: Monitor event loop lag continuously. Implement backpressure before memory exhaustion. Use EPOLLEXCLUSIVE or SO_REUSEPORT for multi-threaded accept. Test under load to find these issues before production.
💡 Key Takeaways
Event loop stall: one blocking handler blocks all connections for its duration
Monitor event loop lag: should be microseconds; milliseconds indicate problems
Backpressure: pause reads and reject requests when buffers fill to prevent OOM
Thundering herd: many waiters wake for one event; EPOLLEXCLUSIVE or SO_REUSEPORT solve it
Graceful degradation: serve some requests well rather than all poorly
📌 Interview Tips
1Explain why one slow DNS lookup can spike P99 for all connections: it blocks the entire event loop
2When discussing memory growth, mention backpressure: pause reads when buffers fill, return 503 when overloaded
3For multi-threaded servers, recommend EPOLLEXCLUSIVE or SO_REUSEPORT to avoid thundering herd on accept
← Back to I/O Models (Blocking, Non-blocking, Async) Overview