Learn→Model Serving & Inference→Latency Optimization (Batching, Caching, Quantization)→3 of 6

Model Serving & Inference • Latency Optimization (Batching, Caching, Quantization)Medium⏱️ ~3 min

What is Model Quantization and When Does It Actually Speed Up Inference?

What Quantization Does
Quantization reduces numerical precision of model weights and activations from 32 bit or 16 bit floating point down to 8 bit or even 4 bit integers. This shrinks model size, reduces memory bandwidth requirements, and can enable faster computation when specialized low precision hardware instructions are available. The key insight is that LLM decoding is typically memory bandwidth bound rather than compute bound: the bottleneck is reading billions of parameters from memory, not the arithmetic operations themselves. Reducing precision from FP16 to INT8 cuts memory traffic in half, which directly improves latency.
Hardware Acceleration
The theoretical compute advantage is substantial on modern accelerators. A TPU or GPU might deliver approximately 378 TFLOPS at FP32, 756 TFLOPS at FP16, and 1,513 TFLOPS at INT8, nearly 4x the FP32 rate. However, realizing this speedup depends entirely on whether your workload is compute bound or memory bound. During autoregressive decoding, where each token generation reads the full model weights but performs relatively little math, weight only quantization primarily helps by reducing bytes transferred, not by utilizing the higher INT8 compute rate. You might see 1.5x to 2x speedup rather than 4x.
Quantization Strategies
Weight only quantization is the first step and safest approach: quantize the model parameters to INT8 or INT4 while keeping activations at higher precision. This immediately cuts model size and memory bandwidth with minimal accuracy loss. Weight plus activation quantization requires more care because activation distributions can have outliers that cause large quantization errors, especially in attention layers. Per channel or per group scaling helps by using different quantization parameters for each output channel or small groups of weights.
KV Cache Quantization
Storing keys and values at FP8 or INT8 instead of FP16 cuts KV memory in half. For a 7B model where each token uses 0.5 MB of KV cache at FP16, quantizing to INT8 cuts this to 0.25 MB per token, doubling the number of concurrent sessions you can fit in memory. The risk is accumulated error over long sequences: quantization noise can compound as the cache grows, causing quality degradation especially in later tokens of a 4,000 or 8,000 token context.

💡 Key Takeaways

✓Quantization reduces precision from FP32 or FP16 to INT8 or INT4, cutting model size and memory bandwidth; effectiveness depends on whether workload is memory bound or compute bound

✓Theoretical INT8 compute is approximately 4× faster than FP32 (1,513 vs 378 teraflops (TFLOPS)), but memory bound decoding sees 1.5× to 2× speedup because bottleneck is data transfer not arithmetic

✓Weight only quantization is safest first step, cutting model size 50% at INT8 with minimal accuracy loss; weight plus activation quantization requires calibration and per channel scaling to handle outliers

✓KV cache quantization from FP16 to INT8 reduces per token memory from 0.5 MB to 0.25 MB for 7B models, doubling concurrent sessions but risking accumulated error in long contexts

✓Activation outliers in attention and feedforward layers cause large quantization errors; per channel or per group scaling adapts quantization parameters to local statistics and improves quality

✓Production systems at Google and Meta validate late token quality after KV quantization and selectively keep sensitive layers at higher precision to avoid degradation at 4k to 8k token contexts

📌 Interview Tips

1Amazon serving stacks use INT8 weight quantization to fit larger models on the same hardware, achieving 1.8× speedup in memory bound decoding with less than 1% accuracy drop after calibration

2A 7B model with FP16 weights (14 GB) quantized to INT8 (7 GB) fits two model replicas on a single 24 GB GPU, doubling serving capacity without additional hardware cost

3Netflix experimentation with KV cache quantization showed 0.25 MB per token at INT8 enables 8 concurrent 2k token sessions on 24 GB GPU versus 4 sessions at FP16, but required fallback to FP16 for sequences exceeding 6k tokens due to quality drift

← Back to Latency Optimization (Batching, Caching, Quantization) Overview