Failure Modes: Thread Explosion and IPC Backpressure

Thread Explosion
Thread explosion occurs when a system creates more threads than it can handle. Each thread consumes 1-8MB stack memory. A server with one thread per connection exhausts memory at 10,000 connections using 10-80GB of stack alone.
With 1000 threads on 8 cores, context switching overhead reaches 1-10% of CPU time. With 10,000 threads, switching dominates actual work. The fix: bounded thread pools with queues that reject work when full.
IPC Backpressure Failures
IPC channels have limited capacity: pipes hold 64KB, sockets hold 128KB-1MB. When producers outpace consumers, buffers fill.
Blocking IPC: full buffers cause producers to wait, potentially cascading upstream. Non blocking IPC: full buffers return errors, forcing choices between retry, local buffering, or dropping messages.
Deadlocks and Memory Leaks
Circular IPC dependencies cause deadlocks: Process A waits for B while B waits for A. Fix with async messaging or strict call ordering.
Memory leaks in threads affect the entire process. A 1KB leak per request at 1000 RPS exhausts 1GB in 17 minutes. Process isolation lets you restart individual workers without affecting others.
💡 Key Insight: Thread explosion and IPC backpressure both stem from unbounded resources. The fix: set limits, queue excess work, apply backpressure upstream.

💡 Key Takeaways

✓Thread explosion: 10,000 threads with 8MB stacks consume 80GB memory and cause severe context switch overhead

✓Context switching 1000 threads on 8 cores loses 1-10% CPU time; with 10,000 threads switching overhead dominates

✓IPC buffers are small (64KB pipes, 128KB-1MB sockets); full buffers cause blocking cascades or data loss

✓Circular IPC dependencies cause deadlocks; fix with async messaging or strict call direction ordering

✓Memory leaks in threads affect entire process; process isolation allows individual worker restarts

📌 Interview Tips

1When debugging slow servers, check thread count: over 100 threads per core indicates potential thread explosion

2For producer consumer systems, size queues based on acceptable latency: a 1000 item queue at 100 items per second adds 10 second max latency

3Recommend worker recycling for long running services: restart workers every N requests or M megabytes to contain memory leaks

← Back to Processes vs Threads Overview