How Do You Tune Inference Serving for Different Workload Patterns?
Latency-Sensitive Interactive Workloads
The priority is minimizing p95 or p99 latency while maintaining acceptable throughput. Use short micro batching windows of 10 to 20 milliseconds to limit queuing delay. Deploy continuous batching with small maximum batch sizes, perhaps 8 to 16 requests, to avoid long sequences blocking short ones. Enable KV caching and prefix reuse aggressively since multi turn conversations benefit enormously. Apply weight only quantization to reduce memory footprint and enable more concurrent sessions, but validate that p99 latency remains stable. Monitor memory utilization closely and set admission control thresholds at 75% to 80% to preserve headroom for bursts.
Throughput-Optimized Batch Workloads
Maximize tokens per second per dollar of compute. Use longer batching windows of 50 to 100 milliseconds or wait until accumulating 32 to 64 requests to fully saturate the accelerator. Increase maximum batch size to the memory limit, tolerating higher tail latency since users are not waiting interactively. Apply aggressive quantization including weight plus activation if quality permits, since even small speedups multiply across millions of tokens. Offloading to slower but cheaper memory tiers becomes viable when latency SLOs are measured in seconds rather than milliseconds. Disable prefix caching unless the batch contains many duplicated prefixes.
Mixed Workloads
Require explicit separation or sophisticated scheduling. Use separate serving pools for interactive and batch traffic to prevent resource contention. Alternatively, priority queues with preemption allow high priority interactive requests to interrupt lower priority batch jobs, accepting some wasted work for better p99 latency. Elastic batch sizes that shrink during interactive traffic bursts and grow during quiet periods balance utilization and responsiveness. Cache aware scheduling routes requests with high expected cache hit rates to instances with warm caches.
The Interaction Effect
Optimizations interact nonlinearly. Aggressive batching plus aggressive quantization can push a previously memory bound workload into compute bound territory, where quantization suddenly provides 2x to 3x speedup instead of 1.5x. Conversely, over aggressive KV cache compression can negate the benefits of continuous batching if quality degradation forces retries. Tuning requires continuous measurement of arithmetic intensity, memory bandwidth utilization, and quality metrics across representative traffic to find the Pareto frontier.