Batching Failure Modes and Mitigation Strategies
Head-of-Line Blocking
One slow item in a batch delays all items. If batch processing takes max(item_times) rather than sum, a single slow request (10x normal) blocks the entire batch. Symptoms: p99 latency is much higher than p50; latency spikes correlate with specific input types. Mitigation: timeout individual items within the batch, process remaining items, return partial results. Or: separate fast and slow paths based on input characteristics (sequence length, complexity score). Profile to identify which inputs cause outliers.
Memory Spikes from Variable Batches
Peak memory occurs when maximum batch size coincides with maximum input size. If max batch is 64 and max sequence is 512 tokens, peak memory is 64×512 even though average is 64×100. The system runs fine most of the time, then crashes on unlucky combinations. Prevention: cap the product batch_size × max_input_size; dynamically reduce batch size when inputs are long; use separate pools for different input size ranges. Set memory limits per worker to fail fast rather than crash the host.
Batch Formation Starvation
During low traffic, batches rarely fill before timeout. Each request waits the full timeout delay for a batch that never forms. Under low load, latency is worse than no batching. Detection: monitor timeout trigger rate; if >80% of batches are timeout-triggered, batching is hurting latency. Fix: reduce timeout at low traffic, or bypass batching entirely when queue depth is below threshold (e.g., process immediately if <4 requests waiting). This threshold should be tuned based on overhead amortization break-even.