ML Model OptimizationBatch Size & Throughput TuningHard⏱️ ~3 min

Batching Failure Modes and Mitigation Strategies

Head-of-Line Blocking

One slow item in a batch delays all items. If batch processing takes max(item_times) rather than sum, a single slow request (10x normal) blocks the entire batch. Symptoms: p99 latency is much higher than p50; latency spikes correlate with specific input types. Mitigation: timeout individual items within the batch, process remaining items, return partial results. Or: separate fast and slow paths based on input characteristics (sequence length, complexity score). Profile to identify which inputs cause outliers.

Memory Spikes from Variable Batches

Peak memory occurs when maximum batch size coincides with maximum input size. If max batch is 64 and max sequence is 512 tokens, peak memory is 64×512 even though average is 64×100. The system runs fine most of the time, then crashes on unlucky combinations. Prevention: cap the product batch_size × max_input_size; dynamically reduce batch size when inputs are long; use separate pools for different input size ranges. Set memory limits per worker to fail fast rather than crash the host.

Batch Formation Starvation

During low traffic, batches rarely fill before timeout. Each request waits the full timeout delay for a batch that never forms. Under low load, latency is worse than no batching. Detection: monitor timeout trigger rate; if >80% of batches are timeout-triggered, batching is hurting latency. Fix: reduce timeout at low traffic, or bypass batching entirely when queue depth is below threshold (e.g., process immediately if <4 requests waiting). This threshold should be tuned based on overhead amortization break-even.

💡 Adaptive Strategy: Adjust batching parameters based on real-time load. High traffic: longer timeout, larger batches. Low traffic: short timeout or no batching. Use queue depth or request rate as the control signal for smooth transitions.
💡 Key Takeaways
Head-of-line blocking: one slow item delays entire batch; timeout individuals and return partial results
Memory spikes: batch_size × max_input_size causes crashes; cap product or reduce batch for long inputs
Starvation: >80% timeout triggers means batching hurts latency; bypass batching at low queue depth
Symptom of head-of-line: p99 latency much higher than p50, spikes correlate with input types
Adaptive: adjust batch params based on load; use queue depth as control signal
📌 Interview Tips
1Describe head-of-line blocking with p99/p50 symptom - shows latency analysis experience
2Explain memory spike from batch×input product and capping solution
3Mention 80% timeout trigger threshold as signal to disable batching at low load
← Back to Batch Size & Throughput Tuning Overview