OS & Systems FundamentalsConcurrency vs ParallelismHard⏱️ ~3 min

Resource Budgeting and Failure Modes at Scale

Production systems fail in predictable ways when concurrency or parallelism exceed safe limits. Too many OS threads cause address space exhaustion: at 0.5 to 2 MB stack reservation per thread, 100,000 threads consume 50 to 200 GB before considering heap usage. Context switching overhead becomes measurable: a 1 microsecond switch cost at 100,000 switches per second burns 10% of a core purely on scheduler work. Event loops block catastrophically if any handler performs synchronous I/O, stalling thousands of connections on a single slow disk read or database query. Cache stampedes occur when a popular cache entry expires and thousands of requests simultaneously hit the backend database. A typical scenario: a cache key serving 10,000 Requests Per Second (RPS) expires, all requests miss, and the database receives a sudden 10,000 query spike instead of the normal trickle of cache refresh traffic. The database p99 latency jumps from 5 milliseconds to 500 milliseconds, and the stampede continues until the cache repopulates. The solution is stale-while-revalidate: serve slightly stale data while exactly one request refreshes the cache in the background. Hot partitions demonstrate Amdahl's Law in practice. When a celebrity user generates 80% of traffic, that single shard becomes the serial bottleneck. If normal shards handle 5,000 RPS at 10 millisecond p99 latency, the hot shard might degrade to 2,000 RPS at 500 millisecond p99, limiting overall system throughput and creating terrible user experience. Mitigation requires application level techniques like caching celebrity data separately, splitting hot keys across sub-shards, or rate limiting individual users. Pure infrastructure scaling cannot solve a fundamental data distribution problem.
💡 Key Takeaways
Thread overhead at scale: 100,000 OS threads reserve 50 to 200 GB address space at 0.5 to 2 MB per stack. Context switching at 1 microsecond per switch and 100,000 switches per second consumes 10% of a core before any application work happens.
Cache stampedes occur when popular entries expire. A key serving 10,000 Requests Per Second (RPS) expires, all requests miss, database receives 10,000 query spike, and p99 latency jumps from 5 milliseconds to 500 milliseconds. Stale-while-revalidate serves slightly old data while one request refreshes asynchronously.
Hot partitions violate uniform load assumptions. When a celebrity user generates 80% of traffic to one shard, that shard becomes the serial bottleneck. Normal shards handle 5,000 RPS at 10 millisecond p99, hot shard degrades to 2,000 RPS at 500 millisecond p99, limiting overall throughput.
Split brain in distributed primaries: Network partition causes two nodes to both accept writes, data conflicts on reconciliation. Thundering herd: 1,000 servers restart simultaneously after deploy, overwhelm downstream dependencies within 60 seconds.
Read replica lag creates user-visible inconsistency. Async replication falls 10 seconds behind during high write load, users see outdated data immediately after their own writes. Either use synchronous replication (doubles write latency), sticky sessions (routes user to same replica), or causal consistency tokens.
📌 Examples
A production cache stampede on a social media feed endpoint serving 50,000 RPS caused database connection pool exhaustion and 30 second p99 latency spikes. Implementation of stale-while-revalidate with 120 second stale window eliminated the issue.
During a viral tweet, Twitter's hot partition problem became visible: the single shard handling that user's timeline degraded from 10 millisecond to 800 millisecond latency. Mitigation required caching the celebrity timeline separately and rate limiting follower fan-out writes.
← Back to Concurrency vs Parallelism Overview