Learn→ML Model Optimization→Model Quantization (INT8, FP16, Mixed Precision)→1 of 6

ML Model Optimization • Model Quantization (INT8, FP16, Mixed Precision)Easy⏱️ ~2 min

What is Model Quantization?

Definition
Model Quantization reduces the precision of model weights and activations from 32-bit floating point (FP32) to lower bit representations like 16-bit (FP16) or 8-bit integers (INT8). This shrinks model size, reduces memory bandwidth, and speeds up inference while accepting small accuracy trade-offs.
Why Quantization Matters
Neural network weights are typically stored as 32-bit floating point numbers. A 1 billion parameter model uses 4GB just for weights. Quantizing to INT8 cuts this to 1GB. Smaller models load faster, fit on cheaper hardware, and process inputs more quickly.
Memory bandwidth is often the bottleneck: GPUs can compute faster than memory can feed them. Quantized models transfer less data between memory and compute units, so the hardware stays busy doing useful work instead of waiting for data.
Precision Levels
FP32 (32-bit float): Full precision training default. 4 bytes per value. Maximum accuracy but slowest and largest.
FP16/BF16 (16-bit float): Half precision. 2 bytes per value. 2x memory reduction, 2x+ speedup on modern GPUs. Minimal accuracy loss for most models.
INT8 (8-bit integer): Quarter precision. 1 byte per value. 4x memory reduction, 4x+ speedup potential. Requires careful calibration to maintain accuracy.
The Accuracy Trade-off
Lower precision means fewer distinct values can be represented. FP32 has billions of possible values; INT8 has only 256. The quantization process maps continuous weights to this limited set. Done well, accuracy drops 0.5-2%. Done poorly, accuracy can collapse entirely.

💡 Key Takeaways

✓Quantization reduces precision from FP32 (4 bytes) to FP16 (2 bytes) or INT8 (1 byte) per weight

✓1B parameter model: FP32 = 4GB, INT8 = 1GB - 4x memory reduction enables deployment on smaller hardware

✓Memory bandwidth is often the bottleneck - smaller data transfers keep compute units busy

✓Accuracy trade-off: 0.5-2% loss when done properly, catastrophic loss when calibration fails

📌 Interview Tips

1Interview Tip: Explain quantization as memory bandwidth optimization, not just storage savings

2Interview Tip: Mention specific precision levels with their size trade-offs: FP32 (4B), FP16 (2B), INT8 (1B)

← Back to Model Quantization (INT8, FP16, Mixed Precision) Overview