ML Model OptimizationModel Quantization (INT8, FP16, Mixed Precision)Easy⏱️ ~2 min

What is Model Quantization?

Model quantization reduces the precision of neural network weights and activations from higher bit formats like FP32 (32 bit floating point) to lower formats like INT8 (8 bit integer) or FP16 (16 bit floating point). The technique uses a scale factor and zero point to map continuous floating point values into discrete integer bins, allowing integer arithmetic to approximate the original operations. The core formula is straightforward. Given a tensor x with minimum and maximum values, compute scale S = (x_max − x_min) / integer_range, choose a zero point Z to align integer zero with a floating point value, then quantize with x_q = round(x/S + Z). To dequantize, compute x = S × (x_q − Z). This mapping enables hardware to perform cheaper integer operations while preserving most of the model's representational capacity. Quantization delivers concrete efficiency gains. INT8 reduces model size by 4 times compared to FP32 and typically provides 2 to 4 times faster inference on hardware with native 8 bit support. FP16 cuts memory by 2 times and can increase arithmetic throughput by up to 16 times on NVIDIA A100 GPUs with Tensor Cores. These improvements translate directly to lower serving costs and faster response times in production systems.
💡 Key Takeaways
Quantization maps high precision tensors to lower bit formats using scale and zero point parameters, enabling integer arithmetic to approximate floating point operations
INT8 quantization achieves 4 times memory reduction and 2 to 4 times inference speedup on hardware with 8 bit support like CPUs and edge Neural Processing Units (NPUs)
FP16 provides 2 times memory savings and up to 16 times compute throughput on NVIDIA A100 Tensor Cores compared to FP32
Symmetric quantization centers around zero for simplicity, while asymmetric quantization with nonzero zero points better handles skewed weight distributions
Quantization error increases with lower precision: INT8 typically maintains accuracy within 1 to 2 percent of FP32, while INT4 requires careful calibration to avoid larger drops
📌 Examples
NVIDIA A100: FP16 mixed precision training delivers 1.5 to 3 times end to end speedup with unchanged convergence when using FP32 accumulation and loss scaling
Google TPUs: BF16 (Brain Float 16) training doubles throughput and halves memory versus FP32 while preserving FP32 exponent range for numerical stability
QLoRA fine tuning: 65 billion parameter LLM quantized to 4 bit weights (NF4 format) enables single GPU fine tuning with 8 times lower memory versus FP32, fitting 130GB model in 16GB
← Back to Model Quantization (INT8, FP16, Mixed Precision) Overview