OS & Systems FundamentalsProcesses vs ThreadsHard⏱️ ~2 min

Failure Modes: Thread Explosion and IPC Backpressure

Thread Explosion

Thread explosion occurs when a system creates more threads than it can handle. Each thread consumes 1-8MB stack memory. A server with one thread per connection exhausts memory at 10,000 connections using 10-80GB of stack alone.

With 1000 threads on 8 cores, context switching overhead reaches 1-10% of CPU time. With 10,000 threads, switching dominates actual work. The fix: bounded thread pools with queues that reject work when full.

IPC Backpressure Failures

IPC channels have limited capacity: pipes hold 64KB, sockets hold 128KB-1MB. When producers outpace consumers, buffers fill.

Blocking IPC: full buffers cause producers to wait, potentially cascading upstream. Non blocking IPC: full buffers return errors, forcing choices between retry, local buffering, or dropping messages.

Deadlocks and Memory Leaks

Circular IPC dependencies cause deadlocks: Process A waits for B while B waits for A. Fix with async messaging or strict call ordering.

Memory leaks in threads affect the entire process. A 1KB leak per request at 1000 RPS exhausts 1GB in 17 minutes. Process isolation lets you restart individual workers without affecting others.

💡 Key Insight: Thread explosion and IPC backpressure both stem from unbounded resources. The fix: set limits, queue excess work, apply backpressure upstream.
💡 Key Takeaways
Thread explosion: 10,000 threads with 8MB stacks consume 80GB memory and cause severe context switch overhead
Context switching 1000 threads on 8 cores loses 1-10% CPU time; with 10,000 threads switching overhead dominates
IPC buffers are small (64KB pipes, 128KB-1MB sockets); full buffers cause blocking cascades or data loss
Circular IPC dependencies cause deadlocks; fix with async messaging or strict call direction ordering
Memory leaks in threads affect entire process; process isolation allows individual worker restarts
📌 Interview Tips
1When debugging slow servers, check thread count: over 100 threads per core indicates potential thread explosion
2For producer consumer systems, size queues based on acceptable latency: a 1000 item queue at 100 items per second adds 10 second max latency
3Recommend worker recycling for long running services: restart workers every N requests or M megabytes to contain memory leaks
← Back to Processes vs Threads Overview