Natural Language Processing Systems • Scalability (Model Parallelism, Batching)Hard⏱️ ~3 min
What Are Common Failure Modes in Large Scale Training and Inference?
Scaling machine learning systems to hundreds of GPUs or thousands of queries per second introduces failure modes that do not appear in small scale experiments. Understanding these edge cases is critical for maintaining service level objectives (SLOs) and training stability.
Communication stalls are the most common training bottleneck. All reduce or reduce scatter operations synchronize gradients across workers. If one worker lags due to thermal throttling, a slow network link, or unbalanced compute, the entire collective blocks. For a 140 GB gradient at 50 GB per second effective bandwidth, a single slow link can add hundreds of milliseconds per step. Across thousands of steps, this compounds into hours of wasted time and can derail learning rate schedules that assume consistent step timing.
Pipeline bubbles occur when stages in pipeline parallelism have unequal compute times. If one stage takes twice as long due to a large attention layer or memory bottleneck, the rest of the pipeline starves waiting for that stage. Increasing micro batches reduces bubble percentage but consumes more activation memory. Rebalancing layers across stages requires retracing the model and is disruptive mid training. Google reports that careful stage balancing within 5 to 10 percent compute variance is necessary to keep pipeline efficiency above 80 percent.
In inference, KV cache exhaustion creates hard failures. Each token consumes 2 to 3 MB of KV memory in large models. Without paged allocation or careful capacity planning, a batch of long requests can exceed available memory, causing out of memory (OOM) errors or severe fragmentation. Length variance pathologies make this worse. Mixing very long and very short prompts in one batch creates head of line blocking where short requests finish decode but cannot exit until the longest prompt completes, blowing up p99 latency from 200 milliseconds to several seconds.
Mixture of Experts (MoE) models introduce hot spot failures. The router can send too many tokens to a few popular experts, causing those GPUs to OOM or throttle while others sit idle. Capacity factors and load balancing losses mitigate this, but cold start prompts or adversarial inputs can still cause spikes. Meta's research on MoE routing shows that without auxiliary losses, top experts can receive 10 times more load than the average, creating persistent imbalance.
Numerical instability in very large batches or mixed precision training can cause non deterministic divergence. Optimizer states overflow, loss scaling fails, or inconsistent collective operations create NaNs that propagate through the model. A single NaN in one worker can corrupt synchronized parameters across all workers. Recovery from these failures is difficult because the exact conditions are hard to reproduce.
💡 Key Takeaways
•Communication stalls from slow workers or links block all reduce operations, adding hundreds of milliseconds per step and compounding into hours over thousands of training steps
•Pipeline bubbles from unbalanced stage compute create chronic idle time, dropping efficiency from 82 percent to 60 percent when one stage takes twice as long as others
•KV cache exhaustion occurs when batch size times context length times per token memory exceeds device capacity, causing OOM errors or 20 percent fragmentation overhead without paging
•MoE hot spots send 10 times more load to popular experts, causing those GPUs to OOM while others idle, requiring auxiliary load balancing losses to distribute tokens evenly
•Length variance pathologies in inference create head of line blocking where short requests wait seconds for long requests to finish, blowing up p99 latency from 200 milliseconds to over 3 seconds
📌 Examples
A training job with 256 GPUs encounters a slow 25 GB per second link on one worker, adding 2.8 seconds per all reduce step versus 50 GB per second baseline, wasting 46 minutes over 1000 steps
Pipeline with 8 stages where stage 4 takes 400 milliseconds and others take 200 milliseconds each creates 200 millisecond bubble per step, reducing efficiency from 82 to 60 percent
Inference batch of 512 sequences at 4 thousand tokens each requires 1.2 TB KV memory, exceeding 1 TB capacity and causing OOM errors or requiring sequence eviction mid decode
MoE router without load balancing sends 10 times more tokens to expert 7, causing GPU 7 to OOM and reducing total throughput by 50 percent until auxiliary loss redistributes load