Model Serving & Inference • Latency Optimization (Batching, Caching, Quantization)Hard⏱️ ~3 min
How Do You Tune Inference Serving for Different Workload Patterns?
Workload characteristics dictate which optimization levers to pull and how aggressively to tune them. Production systems must adapt to different traffic patterns: interactive chat with strict p95 latency requirements, batch document processing optimizing for throughput, and mixed workloads requiring careful resource partitioning.
For latency sensitive interactive workloads, the priority is minimizing p95 or p99 latency while maintaining acceptable throughput. Use short micro batching windows of 10 to 20 milliseconds to limit queuing delay. Deploy continuous batching with small maximum batch sizes, perhaps 8 to 16 requests, to avoid long sequences blocking short ones. Enable KV caching and prefix reuse aggressively since multi turn conversations benefit enormously. Apply weight only quantization to reduce memory footprint and enable more concurrent sessions, but validate that p99 latency remains stable. Monitor memory utilization closely and set admission control thresholds at 75% to 80% to preserve headroom for bursts. Response caching helps for repeated queries but requires careful Time To Live (TTL) and invalidation to avoid staleness.
Throughput optimized batch workloads flip these priorities: maximize tokens per second per dollar of compute. Use longer batching windows of 50 to 100 milliseconds or wait until accumulating 32 to 64 requests to fully saturate the accelerator. Increase maximum batch size to the memory limit, tolerating higher tail latency since users are not waiting interactively. Apply aggressive quantization including weight plus activation if quality permits, since even small speedups multiply across millions of tokens. Offloading to slower but cheaper memory tiers becomes viable when latency Service Level Objectives (SLOs) are measured in seconds rather than milliseconds. Disable prefix caching unless the batch contains many duplicated prefixes, as cache management overhead can outweigh benefits for unique documents.
Mixed workloads require explicit separation or sophisticated scheduling. Google and Meta use separate serving pools for interactive and batch traffic to prevent resource contention. Alternatively, priority queues with preemption allow high priority interactive requests to interrupt lower priority batch jobs, accepting some wasted work for better p99 latency. Elastic batch sizes that shrink during interactive traffic bursts and grow during quiet periods balance utilization and responsiveness. Cache aware scheduling routes requests with high expected cache hit rates to instances with warm caches, while cold requests go to separate capacity.
The critical insight is that optimizations interact nonlinearly. Aggressive batching plus aggressive quantization can push a previously memory bound workload into compute bound territory, where quantization suddenly provides 2× to 3× speedup instead of 1.5×. Conversely, over aggressive KV cache compression can negate the benefits of continuous batching if quality degradation forces retries. Tuning requires continuous measurement of arithmetic intensity, memory bandwidth utilization, and quality metrics across representative traffic to find the Pareto frontier.
💡 Key Takeaways
•Interactive workloads prioritize p95 or p99 latency with short 10 to 20 ms batching windows, continuous batching capped at 8 to 16 requests, and admission control at 75% to 80% memory to preserve burst headroom
•Throughput optimized batch workloads use 50 to 100 ms batching windows or wait for 32 to 64 requests, maximize batch size to memory limit, apply aggressive weight plus activation quantization for multiplied speedup across millions of tokens
•Mixed workloads require resource separation with dedicated serving pools or priority queues with preemption, allowing high priority interactive requests to interrupt batch jobs at cost of wasted computation
•Optimization interactions are nonlinear: aggressive batching plus quantization can shift workload from memory bound to compute bound, changing quantization speedup from 1.5× to 3×; measurement required to find Pareto frontier
•Cache aware scheduling routes requests with expected high cache hit rates to instances with warm prefix caches, while cold requests use separate capacity to avoid thrashing shared caches
•Elastic batch sizing dynamically adjusts maximum batch size based on current traffic mix, shrinking during interactive bursts to protect latency and growing during quiet periods to improve utilization
📌 Examples
Amazon separates interactive product recommendations (p95 target 100 ms) from batch email personalization (target 10 minutes) into different serving fleets, with interactive using small batches and batch using 64+ request batches to saturate Graphics Processing Units (GPUs)
Netflix tunes recommendation inference with 15 ms batching windows and batch size 12 for homepage (p99 150 ms Service Level Objective (SLO)), but uses 100 ms windows and batch size 64 for overnight email generation (no latency SLO)
Google production systems measure arithmetic intensity continuously and increase quantization aggressiveness when workload shifts toward compute bound during batch processing, then back off to weight only quantization during interactive peaks
Meta deploys priority queues where chat requests can preempt document summarization jobs, accepting up to 20% wasted work on preempted jobs to keep chat p99 under 200 ms during traffic bursts