OS & Systems Fundamentals • I/O Models (Blocking, Non-blocking, Async)Medium⏱️ ~3 min
Blocking vs Non-Blocking I/O: Memory and Threading Trade-offs
Blocking Input/Output (I/O) ties a thread to each connection or request. When the thread waits for data from the network or disk, the kernel puts it to sleep consuming no Central Processing Unit (CPU) cycles but reserving memory. Each thread typically reserves approximately 1 MB for its stack, so 100,000 concurrent connections consume roughly 100 GB of memory just for thread stacks, making this approach impractical at scale. Context switching between thousands of threads also adds measurable scheduler overhead.
Non-blocking I/O fundamentally changes this equation. A small, fixed number of threads (often one per CPU core) multiplex thousands or hundreds of thousands of I/O operations. Instead of sleeping, the application registers interest in sockets and gets notified when they are ready for reading or writing. A single event loop can track 100,000 plus file descriptors with operations that scale with the number of events, not the total number of connections.
The performance difference is dramatic in practice. Nginx routinely handles approximately 20,000 requests per second on commodity hardware using non-blocking event loops. Vert.x based services report peaks exceeding 50,000 requests per second with just a handful of event loop threads. One reported optimization converted blocking network calls to non-blocking and reduced infrastructure from 75 service instances handling 260,000 requests per minute down to just 5 instances handling 250,000 requests per minute with burst capacity exceeding 400,000 requests per second. This represents an 18x efficiency improvement driven entirely by eliminating idle blocked threads.
💡 Key Takeaways
•Blocking I/O reserves approximately 1 MB per thread. At 100,000 concurrent connections this means roughly 100 GB just for thread stacks, creating memory pressure that makes scaling infeasible on single nodes.
•Non-blocking event loops use fixed thread counts (typically one per CPU core) to multiplex thousands of connections. Monitoring 10,000 file descriptors takes approximately 0.66 ms with modern readiness mechanisms versus 900 to 990 ms with older approaches.
•Production gains are substantial. One system went from 75 instances at 260,000 requests per minute to 5 instances at 250,000 requests per minute, an 18x efficiency improvement by eliminating idle blocked threads.
•Context switching overhead compounds at scale. Thousands of threads scheduling on limited cores creates measurable CPU time spent in the kernel scheduler rather than application work.
•Choose blocking when concurrency is low to moderate (tens to hundreds of connections) and code simplicity matters. Choose non-blocking when handling thousands of concurrent connections or when requests are I/O bound rather than CPU bound.
📌 Examples
Nginx web server: Handles approximately 20,000 requests per second on commodity hardware using non-blocking event loops with a small fixed thread pool.
Netflix API Gateway: Uses non-blocking asynchronous I/O to handle massive fan in from millions of streaming clients with minimal instance counts, avoiding the memory cost of thread per connection models.
Amazon services: Internal microservice meshes increasingly use asynchronous Remote Procedure Call (RPC) clients to avoid tying threads to slow network paths, especially important when services make multiple downstream calls per request.