OS & Systems Fundamentals • Concurrency vs ParallelismHard⏱️ ~3 min
Combined Strategies: Production Fan Out Patterns
Real production systems combine concurrency and parallelism with strict isolation and tail-tolerant patterns. Consider a typical microservice request flow: the gateway handles 100,000 concurrent connections using event driven I/O and a small thread pool (concurrency), each request fans out 20 to 50 parallel Remote Procedure Calls (RPCs) to backend shards (parallelism), and each backend itself manages thousands of concurrent in-flight operations while spreading work across available cores.
The key is layered control. At the client, cap total parallelism per dependency using semaphores to prevent overwhelming any single backend. Propagate deadlines end to end so backends know their remaining budget and can cancel work that will arrive too late. Use hedged requests selectively: after waiting p95 latency, send a duplicate to another replica and take the first response. This cuts tail latency without doubling load, since only the slow minority triggers hedges. Apply circuit breakers to stop request flow when failure rates exceed thresholds, allowing degraded services to recover instead of being hammered by retries.
Uber's architecture exemplifies this. Latency-sensitive microservices cap parallelism per handler and per downstream, enforce deadlines at the RPC library layer, and use early cancel on quorum or first success for geo-sharded fan-outs. Netflix combines bulkhead isolation for different downstream dependencies, rate limiting to shed load gracefully under pressure, and async batch processing to parallelize CPU-heavy tasks without blocking latency-critical paths. Google's approach adds hedged requests and partial result strategies: if complete fan-in exceeds the client deadline, return what is available rather than failing entirely. All these patterns share a common theme: explicit limits and fast failure paths prevent cascading disasters.
💡 Key Takeaways
•Layered control is essential. Cap total parallelism per dependency at the client (bulkhead isolation), propagate deadlines end to end so backends know remaining budget, and apply circuit breakers to stop traffic when failure rates exceed thresholds like 50% over 10 seconds.
•Hedged requests fire after p95 latency (for example, 10 milliseconds). Only the slow minority triggers duplicates, adding less than 5% extra load while bringing p99 latency back toward p50. Requires idempotency and duplicate suppression to avoid side effects.
•Uber caps parallelism per handler and per downstream, enforces deadlines at Remote Procedure Call (RPC) libraries, and uses early cancel on quorum reads to cut tail impact when fanning out to geo-sharded services across multiple regions.
•Netflix combines bulkhead isolation (separate concurrency pools per dependency), rate limiting for graceful load shedding, and async batch processing to parallelize CPU-heavy tasks without blocking user-facing latency paths.
•Google returns partial results when complete fan-in exceeds client deadlines. If 95 of 100 shards respond within budget, serve that rather than failing entirely. This maintains availability under degradation at the cost of slightly reduced completeness.
📌 Examples
A microservice gateway handling 100,000 concurrent connections fans out each request to 50 backend shards in parallel. Each backend manages 5,000 concurrent operations across 32 cores, combining all three layers: gateway concurrency, request parallelism, and backend concurrency plus parallelism.
During a Netflix backend slowdown, circuit breakers prevented a cascading failure by stopping requests after detecting elevated error rates. The impacted service recovered within 30 seconds instead of being overwhelmed by continuous retry storms.