Learn→OS & Systems Fundamentals→I/O Models (Blocking, Non-blocking, Async)→4 of 4

OS & Systems Fundamentals • I/O Models (Blocking, Non-blocking, Async)Hard⏱️ ~3 min

Event Loop Failure Modes: Stalls, Backpressure, and Thundering Herds

Event Loop Stalls
Event loops must not block. One blocking call stalls all connections. A handler that takes 100 milliseconds blocks 100,000 other connections for that duration. P99 latency spikes to seconds.
Common stall causes: synchronous file I/O, DNS lookups, CPU intensive computation in handlers, accidentally calling blocking APIs. Monitor event loop lag: time between iterations should be microseconds. Millisecond lags indicate problems. Second lags are emergencies.
Backpressure Failures
When producers outpace consumers, buffers grow. Without backpressure, memory exhausts. Event loops accumulate data faster than handlers process it. Each connection queues data. 10,000 connections with 10 KB buffered is 100 MB. Growth continues until OOM.
Implement backpressure by pausing reads when buffers fill. Stop accepting new connections when at capacity. Return 503 rather than accepting requests you cannot handle. The goal is graceful degradation: serve some requests well rather than all requests poorly.
Thundering Herd
Multiple threads or processes waiting on the same event wake simultaneously. If all 64 workers epoll_wait on a listen socket, one incoming connection wakes all 64. Only one accepts successfully. The other 63 wasted CPU and context switches.
Solutions: EPOLLEXCLUSIVE wakes only one waiter. SO_REUSEPORT creates separate accept queues per thread. Load balancer distributes connections before they reach the server. Modern kernels and libraries handle this, but misconfiguration still causes thundering herds.
🎯 When To Use: Monitor event loop lag continuously. Implement backpressure before memory exhaustion. Use EPOLLEXCLUSIVE or SO_REUSEPORT for multi-threaded accept. Test under load to find these issues before production.

💡 Key Takeaways

✓Event loop stall: one blocking handler blocks all connections for its duration

✓Monitor event loop lag: should be microseconds; milliseconds indicate problems

✓Backpressure: pause reads and reject requests when buffers fill to prevent OOM

✓Thundering herd: many waiters wake for one event; EPOLLEXCLUSIVE or SO_REUSEPORT solve it

✓Graceful degradation: serve some requests well rather than all poorly

📌 Interview Tips

1Explain why one slow DNS lookup can spike P99 for all connections: it blocks the entire event loop

2When discussing memory growth, mention backpressure: pause reads when buffers fill, return 503 when overloaded

3For multi-threaded servers, recommend EPOLLEXCLUSIVE or SO_REUSEPORT to avoid thundering herd on accept

← Back to I/O Models (Blocking, Non-blocking, Async) Overview