Parallelism Limits: Amdahl's Law and Coordination Costs

Amdahl's Law Explained
Amdahl's Law calculates maximum speedup from parallelization. If 90% of work can be parallelized and 10% must remain serial, infinite cores yield only 10x speedup. The serial portion becomes the bottleneck. Formula: Speedup = 1 / (S + P/N) where S is serial fraction, P is parallel fraction, N is cores.
With 4 cores and 75% parallelizable work: 1 / (0.25 + 0.75/4) = 2.3x speedup. Not 4x because 25% remains serial. At 16 cores: 1 / (0.25 + 0.75/16) = 3.4x. Diminishing returns set in quickly. Adding cores past this point wastes money.
Coordination Overhead
Real systems perform worse than Amdahl predicts because parallelism has costs. Threads must synchronize access to shared data. Locks serialize access, reducing actual parallelism. Lock contention increases with thread count. More threads can mean more waiting.
Cache coherency adds hidden costs. When one core writes data, other cores must invalidate their cached copies. The data must travel between cores. This takes 50 to 100 nanoseconds per cache line. With many cores contending for the same data, this overhead dominates actual computation.
Finding The Sweet Spot
Profile before parallelizing. Identify the actual serial fraction through measurement, not estimation. Common serial bottlenecks: initialization, result aggregation, shared state updates, I/O serialization points.
Test with varying thread counts. Throughput often peaks before core count. A 16 core machine might peak at 8 threads due to lock contention or memory bandwidth saturation. Graph throughput versus thread count to find the optimal operating point. Going beyond it wastes resources and can reduce performance.
⚠️ Key Trade-off: More parallelism requires more coordination. At some point, coordination cost exceeds benefit. The goal is not maximum parallelism but optimal parallelism where throughput peaks.

💡 Key Takeaways

✓Amdahl's Law: Speedup = 1/(S + P/N) where S is serial fraction, limiting maximum gains

✓75% parallelizable work with 4 cores gives only 2.3x speedup, not 4x

✓Lock contention and cache coherency add overhead that Amdahl does not account for

✓Throughput often peaks before core count due to coordination costs

✓Profile to find actual serial fraction; test to find optimal thread count

📌 Interview Tips

1Calculate Amdahl's Law in the interview: if 10% is serial, show that infinite cores only give 10x speedup maximum

2Explain why adding more threads can decrease performance: lock contention, cache line bouncing, memory bandwidth saturation

3When asked about scaling, mention that you would profile first to find the serial bottleneck before adding parallelism

← Back to Concurrency vs Parallelism Overview