Learn→Natural Language Processing Systems→LLM Serving (KV-cache, Continuous Batching, Speculative Decoding)→5 of 6

Natural Language Processing Systems • LLM Serving (KV-cache, Continuous Batching, Speculative Decoding)Hard⏱️ ~3 min

What Are the Key Trade-offs in LLM Serving Optimizations?

LATENCY VS THROUGHPUT
The fundamental trade-off in LLM serving. Larger batch sizes improve throughput (tokens per second) but increase latency for individual requests (time to first token, time per token).
Batch size 1: Lowest latency (~10ms/token for 7B model). GPU underutilized. Throughput: ~100 tokens/second.
Batch size 32: Higher latency (~30ms/token). GPU fully utilized. Throughput: ~3000 tokens/second.
Choose based on use case: interactive chat needs low latency (small batches). Bulk document processing needs throughput (large batches).
PRECISION VS QUALITY
Lower precision (FP16, INT8, INT4) reduces memory and increases speed but may degrade output quality.
FP16: Standard for serving. Minimal quality loss. 2x memory savings vs FP32.
INT8: 2x further memory savings. Quality depends on quantization method. Good methods (GPTQ, AWQ) maintain quality for most tasks.
INT4: 4x savings vs INT8. Noticeable quality degradation on complex reasoning tasks. Acceptable for simpler generation.
Test quality on your specific use case before deploying quantized models in production.
CONTEXT LENGTH VS COST
Longer context enables more capabilities but increases memory and compute cost quadratically (attention is O(N²)).
Trade-off: 32K context costs ~4x more than 8K context. Do you need the full context? Often you can truncate or summarize to use shorter context at lower cost.
Sliding window attention and other techniques reduce this cost but may lose information from truncated context.
💡 Key Insight: There is no single optimal configuration. Match optimization choices to your specific latency, throughput, quality, and cost requirements. Profile and measure before and after each optimization.

💡 Key Takeaways

✓Latency vs throughput: batch size 1 = 10ms/token, 100 tok/s; batch 32 = 30ms/token, 3000 tok/s

✓Precision: FP16 standard; INT8 (GPTQ/AWQ) good quality; INT4 noticeable degradation on complex reasoning

✓Context length: 32K costs ~4x of 8K; often can truncate/summarize to reduce cost

📌 Interview Tips

1Interview Tip: Give specific numbers for batch size vs latency/throughput tradeoffs.

2Interview Tip: Explain INT8 quantization methods (GPTQ, AWQ) and when quality degrades.

← Back to LLM Serving (KV-cache, Continuous Batching, Speculative Decoding) Overview