What Are the Key Trade-offs in LLM Serving Optimizations?
LATENCY VS THROUGHPUT
The fundamental trade-off in LLM serving. Larger batch sizes improve throughput (tokens per second) but increase latency for individual requests (time to first token, time per token).
Batch size 1: Lowest latency (~10ms/token for 7B model). GPU underutilized. Throughput: ~100 tokens/second.
Batch size 32: Higher latency (~30ms/token). GPU fully utilized. Throughput: ~3000 tokens/second.
Choose based on use case: interactive chat needs low latency (small batches). Bulk document processing needs throughput (large batches).
PRECISION VS QUALITY
Lower precision (FP16, INT8, INT4) reduces memory and increases speed but may degrade output quality.
FP16: Standard for serving. Minimal quality loss. 2x memory savings vs FP32.
INT8: 2x further memory savings. Quality depends on quantization method. Good methods (GPTQ, AWQ) maintain quality for most tasks.
INT4: 4x savings vs INT8. Noticeable quality degradation on complex reasoning tasks. Acceptable for simpler generation.
Test quality on your specific use case before deploying quantized models in production.
CONTEXT LENGTH VS COST
Longer context enables more capabilities but increases memory and compute cost quadratically (attention is O(N²)).
Trade-off: 32K context costs ~4x more than 8K context. Do you need the full context? Often you can truncate or summarize to use shorter context at lower cost.
Sliding window attention and other techniques reduce this cost but may lose information from truncated context.