Learn→Natural Language Processing Systems→Scalability (Model Parallelism, Batching)→6 of 6

Natural Language Processing Systems • Scalability (Model Parallelism, Batching)Medium⏱️ ~3 min

Scaling Failures: Memory Fragmentation, Stragglers, and Graceful Degradation

Memory Fragmentation
GPU memory fragments over time as different sized tensors are allocated and freed. After hours of serving variable-length requests, you might have 8GB free memory but cannot allocate a contiguous 4GB tensor because free memory is scattered in small chunks. The symptom: out of memory errors despite memory monitoring showing available capacity.
The fix requires periodic restarts or memory defragmentation. Some serving frameworks implement memory pools that pre-allocate fixed-size chunks, avoiding fragmentation at the cost of some wasted space. For long-running services, schedule restarts during low-traffic periods before fragmentation causes problems.
Stragglers and Tail Latency
In distributed inference, the slowest component determines overall latency. If you use 4-GPU tensor parallelism and one GPU runs 50% slower due to thermal throttling, every request is 50% slower. At scale, stragglers become statistically common. With 100 replicas, at any moment several are likely experiencing degraded performance.
Mitigation strategies include hedged requests (send the same request to multiple replicas, take the first response), aggressive health checks that remove slow replicas from the pool, and keeping replicas on homogeneous hardware to minimize variance.
⚠️ Common Failure: Mixing GPU generations in the same serving pool causes systematic stragglers. An A100 and V100 have 2-3x performance difference. The slow V100 becomes a bottleneck, degrading the entire pool's P99 latency.
Graceful Degradation
When load exceeds capacity, systems must degrade gracefully rather than collapse. Options include request shedding (reject requests above capacity with clear error rather than timing out), quality reduction (switch to smaller model or skip optional processing), and priority queuing (serve premium users first, delay others). Design these mechanisms before you need them - implementing graceful degradation during an outage is too late.

💡 Key Takeaways

✓GPU memory fragments over time - 8GB free but scattered cannot allocate 4GB contiguous tensor. Schedule periodic restarts.

✓Stragglers determine distributed inference latency - one slow GPU in 4-way tensor parallelism slows all requests

✓Mixed GPU generations (A100 + V100) cause systematic stragglers due to 2-3x performance gap between generations

✓Design graceful degradation before needed: request shedding, quality reduction (smaller model), priority queuing

📌 Interview Tips

1Explain memory fragmentation: OOM errors despite available memory because free space is scattered. Periodic restarts are the fix.

2For tail latency, describe hedged requests: send to multiple replicas, take the first response. Trades compute for latency.

3Proactively mention graceful degradation options: reject with clear error beats silent timeout.

← Back to Scalability (Model Parallelism, Batching) Overview