What is Model Quantization?
Why Quantization Matters
Neural network weights are typically stored as 32-bit floating point numbers. A 1 billion parameter model uses 4GB just for weights. Quantizing to INT8 cuts this to 1GB. Smaller models load faster, fit on cheaper hardware, and process inputs more quickly.
Memory bandwidth is often the bottleneck: GPUs can compute faster than memory can feed them. Quantized models transfer less data between memory and compute units, so the hardware stays busy doing useful work instead of waiting for data.
Precision Levels
FP32 (32-bit float): Full precision training default. 4 bytes per value. Maximum accuracy but slowest and largest.
FP16/BF16 (16-bit float): Half precision. 2 bytes per value. 2x memory reduction, 2x+ speedup on modern GPUs. Minimal accuracy loss for most models.
INT8 (8-bit integer): Quarter precision. 1 byte per value. 4x memory reduction, 4x+ speedup potential. Requires careful calibration to maintain accuracy.
The Accuracy Trade-off
Lower precision means fewer distinct values can be represented. FP32 has billions of possible values; INT8 has only 256. The quantization process maps continuous weights to this limited set. Done well, accuracy drops 0.5-2%. Done poorly, accuracy can collapse entirely.