Scaling Failures: Memory Fragmentation, Stragglers, and Graceful Degradation
Memory Fragmentation
GPU memory fragments over time as different sized tensors are allocated and freed. After hours of serving variable-length requests, you might have 8GB free memory but cannot allocate a contiguous 4GB tensor because free memory is scattered in small chunks. The symptom: out of memory errors despite memory monitoring showing available capacity.
The fix requires periodic restarts or memory defragmentation. Some serving frameworks implement memory pools that pre-allocate fixed-size chunks, avoiding fragmentation at the cost of some wasted space. For long-running services, schedule restarts during low-traffic periods before fragmentation causes problems.
Stragglers and Tail Latency
In distributed inference, the slowest component determines overall latency. If you use 4-GPU tensor parallelism and one GPU runs 50% slower due to thermal throttling, every request is 50% slower. At scale, stragglers become statistically common. With 100 replicas, at any moment several are likely experiencing degraded performance.
Mitigation strategies include hedged requests (send the same request to multiple replicas, take the first response), aggressive health checks that remove slow replicas from the pool, and keeping replicas on homogeneous hardware to minimize variance.
Graceful Degradation
When load exceeds capacity, systems must degrade gracefully rather than collapse. Options include request shedding (reject requests above capacity with clear error rather than timing out), quality reduction (switch to smaller model or skip optional processing), and priority queuing (serve premium users first, delay others). Design these mechanisms before you need them - implementing graceful degradation during an outage is too late.