What is Model Quantization and When Does It Actually Speed Up Inference?
What Quantization Does
Quantization reduces numerical precision of model weights and activations from 32 bit or 16 bit floating point down to 8 bit or even 4 bit integers. This shrinks model size, reduces memory bandwidth requirements, and can enable faster computation when specialized low precision hardware instructions are available. The key insight is that LLM decoding is typically memory bandwidth bound rather than compute bound: the bottleneck is reading billions of parameters from memory, not the arithmetic operations themselves. Reducing precision from FP16 to INT8 cuts memory traffic in half, which directly improves latency.
Hardware Acceleration
The theoretical compute advantage is substantial on modern accelerators. A TPU or GPU might deliver approximately 378 TFLOPS at FP32, 756 TFLOPS at FP16, and 1,513 TFLOPS at INT8, nearly 4x the FP32 rate. However, realizing this speedup depends entirely on whether your workload is compute bound or memory bound. During autoregressive decoding, where each token generation reads the full model weights but performs relatively little math, weight only quantization primarily helps by reducing bytes transferred, not by utilizing the higher INT8 compute rate. You might see 1.5x to 2x speedup rather than 4x.
Quantization Strategies
Weight only quantization is the first step and safest approach: quantize the model parameters to INT8 or INT4 while keeping activations at higher precision. This immediately cuts model size and memory bandwidth with minimal accuracy loss. Weight plus activation quantization requires more care because activation distributions can have outliers that cause large quantization errors, especially in attention layers. Per channel or per group scaling helps by using different quantization parameters for each output channel or small groups of weights.
KV Cache Quantization
Storing keys and values at FP8 or INT8 instead of FP16 cuts KV memory in half. For a 7B model where each token uses 0.5 MB of KV cache at FP16, quantizing to INT8 cuts this to 0.25 MB per token, doubling the number of concurrent sessions you can fit in memory. The risk is accumulated error over long sequences: quantization noise can compound as the cache grows, causing quality degradation especially in later tokens of a 4,000 or 8,000 token context.