Model Serving & InferenceLatency Optimization (Batching, Caching, Quantization)Medium⏱️ ~3 min

What is Model Quantization and When Does It Actually Speed Up Inference?

Quantization reduces numerical precision of model weights and activations from 32 bit or 16 bit floating point down to 8 bit or even 4 bit integers. This shrinks model size, reduces memory bandwidth requirements, and can enable faster computation when specialized low precision hardware instructions are available. The key insight is that large language model decoding is typically memory bandwidth bound rather than compute bound: the bottleneck is reading billions of parameters from memory, not the arithmetic operations themselves. Reducing precision from FP16 to INT8 cuts memory traffic in half, which directly improves latency when bandwidth is the constraint. The theoretical compute advantage is substantial on modern accelerators. A Tensor Processing Unit (TPU) or Graphics Processing Unit (GPU) might deliver approximately 378 teraflops (TFLOPS) at FP32 precision, 756 TFLOPS at FP16, and 1,513 TFLOPS at INT8, nearly 4× the FP32 rate. However, realizing this speedup depends entirely on whether your workload is compute bound or memory bound. During autoregressive decoding, where each token generation reads the full model weights but performs relatively little math, weight only quantization primarily helps by reducing bytes transferred, not by utilizing the higher INT8 compute rate. You might see 1.5× to 2× speedup rather than 4×. Weight only quantization is the first step and safest approach: quantize the model parameters to INT8 or INT4 while keeping activations at higher precision. This immediately cuts model size and memory bandwidth with minimal accuracy loss. Weight plus activation quantization requires more care because activation distributions can have outliers that cause large quantization errors, especially in attention layers. Per channel or per group scaling helps by using different quantization parameters for each output channel or small groups of weights, adapting to local statistics. Calibration on a representative dataset is essential to choose good scaling factors. KV cache quantization applies the same principles to the attention cache, storing keys and values at FP8 or INT8 instead of FP16. For a 7B model where each token uses 0.5 MB of KV cache at FP16, quantizing to INT8 cuts this to 0.25 MB per token, doubling the number of concurrent sessions you can fit in memory. The risk is accumulated error over long sequences: quantization noise can compound as the cache grows, causing quality degradation especially in the later tokens of a 4,000 or 8,000 token context. Production systems at Google and Meta validate late token quality carefully and sometimes keep particularly sensitive layers at higher precision.
💡 Key Takeaways
Quantization reduces precision from FP32 or FP16 to INT8 or INT4, cutting model size and memory bandwidth; effectiveness depends on whether workload is memory bound or compute bound
Theoretical INT8 compute is approximately 4× faster than FP32 (1,513 vs 378 teraflops (TFLOPS)), but memory bound decoding sees 1.5× to 2× speedup because bottleneck is data transfer not arithmetic
Weight only quantization is safest first step, cutting model size 50% at INT8 with minimal accuracy loss; weight plus activation quantization requires calibration and per channel scaling to handle outliers
KV cache quantization from FP16 to INT8 reduces per token memory from 0.5 MB to 0.25 MB for 7B models, doubling concurrent sessions but risking accumulated error in long contexts
Activation outliers in attention and feedforward layers cause large quantization errors; per channel or per group scaling adapts quantization parameters to local statistics and improves quality
Production systems at Google and Meta validate late token quality after KV quantization and selectively keep sensitive layers at higher precision to avoid degradation at 4k to 8k token contexts
📌 Examples
Amazon serving stacks use INT8 weight quantization to fit larger models on the same hardware, achieving 1.8× speedup in memory bound decoding with less than 1% accuracy drop after calibration
A 7B model with FP16 weights (14 GB) quantized to INT8 (7 GB) fits two model replicas on a single 24 GB GPU, doubling serving capacity without additional hardware cost
Netflix experimentation with KV cache quantization showed 0.25 MB per token at INT8 enables 8 concurrent 2k token sessions on 24 GB GPU versus 4 sessions at FP16, but required fallback to FP16 for sequences exceeding 6k tokens due to quality drift
← Back to Latency Optimization (Batching, Caching, Quantization) Overview
What is Model Quantization and When Does It Actually Speed Up Inference? | Latency Optimization (Batching, Caching, Quantization) - System Overflow