OS & Systems Fundamentals • Processes vs ThreadsHard⏱️ ~2 min
Failure Modes: Thread Explosion and IPC Backpressure
Thread explosion occurs when thread counts far exceed core counts, triggering scheduler pathologies. With default 1 MB stacks, 10,000 threads reserve roughly 10 GB of virtual address space. Even if not fully committed, the kernel must track all these stacks and guard pages. More critically, oversubscribing CPU bound threads by 10x causes context switch storms. Observed context switches per second can jump into the millions, and the OS spends more time switching than executing useful work. This inflates 99th percentile latency by 5 to 10x and causes jitter that violates Service Level Objectives (SLOs).
False sharing is a subtle threading bug that destroys performance. When two hot fields sit on the same 64 byte cache line but are updated by different threads, the cache line bounces between CPU cores. Each update invalidates the line in other caches, causing expensive cache coherency traffic. This can slow down tight loops by 10 to 100x. Production symptoms include mysteriously high CPU usage and tail latencies that spike under load. The fix requires padding hot structures to cache line boundaries and separating read mostly data from write hot data.
IPC backpressure collapse happens when producers flood consumers faster than they can process messages, and there's no flow control. Unbounded queues grow until the receiving process runs out of memory and crashes. PostgreSQL process per connection can hit this: if application code opens 5,000 connections simultaneously without a connection pooler, the database spawns 5,000 backend processes consuming 25 to 50 GB of RAM. The system thrashes, queries slow to a crawl, and eventually the Out Of Memory (OOM) killer terminates the entire database process.
💡 Key Takeaways
•Thread explosion with 10x oversubscription causes context switch storms reaching millions of switches per second. The kernel spends more cycles switching than executing application code, inflating tail latencies by 5 to 10x.
•False sharing on cache lines causes 10 to 100x slowdowns when hot fields on the same 64 byte cache line are updated by different threads. Cache line bouncing creates expensive coherency traffic across CPU cores.
•IPC backpressure collapse occurs without flow control: producers flood unbounded queues until consumers run out of memory. PostgreSQL spawning 5,000 backend processes simultaneously can consume 40+ GB RAM and trigger OOM killer.
•Fork in multi threaded processes is dangerous: only the calling thread is duplicated, but locks held by other parent threads are copied in locked state. Child process can deadlock immediately trying to acquire these locks.
•NUMA memory access penalties double tail latencies when threads allocated on one NUMA node access memory on another. A 100 nanosecond local access becomes 200 nanoseconds remote, cascading through the call stack.
📌 Examples
Production incident: Service scaled to 5,000 threads on 32 core machine. Context switches jumped from 50K/sec to 3M/sec. p99 latency increased from 8ms to 80ms. Fix: capped thread pool to 128 threads.
False sharing bug: Two counters on same cache line updated by different threads. CPU usage 80% but throughput only 50K ops/sec. After padding to separate cache lines: CPU 20%, throughput 500K ops/sec.
PostgreSQL without pooler: App spawns 3000 connections during traffic spike. Database spawns 3000 processes × 8 MB = 24 GB. System thrashes, OOM killer terminates postgres. Fix: PgBouncer limits to 200 backend connections.