ML Model OptimizationModel Quantization (INT8, FP16, Mixed Precision)Hard⏱️ ~2 min

Quantization Failure Modes and Mitigation Strategies

Quantization introduces several failure modes that can silently degrade accuracy or eliminate performance gains. Poor calibration is the most common issue. Using min max ranges across the full calibration dataset makes scales sensitive to outliers. A single anomalous activation at 1000x typical magnitude forces all normal values into a tiny fraction of the integer range, losing effective precision. Clipping outliers at the 99.9th percentile or using exponential moving averages for ranges reduces quantization error by 30 to 50 percent. Transformers exhibit heavy tailed activation distributions, especially in attention and residual connections, where a few channels have values 100 to 1000 times larger than the median. Quantizing these activations to INT8 causes large errors because the scale must accommodate the outliers. The solution is weight only quantization with FP16 or BF16 activations, or selectively keeping outlier prone layers like layer normalization and softmax in higher precision. Production LLM serving systems commonly use INT4 or INT8 weights with FP16 activations to avoid this failure mode. Hardware fallback is a subtle trap. On devices without efficient low bit kernels, quantized models may dequantize to FP32 at runtime or fall back to scalar code, eliminating speedups. Always profile end to end latency after quantization. If INT8 inference shows no improvement on your target hardware, the runtime is likely missing optimized kernels. Distribution shift between calibration and production can cause activation saturation, where production inputs exceed calibrated ranges and clip to integer limits. Monitor activation statistics in production and refresh calibration when data drift is detected.
💡 Key Takeaways
Min max calibration with outliers wastes 90 percent of integer codes: if one activation reaches 1000 but most are under 10, scale is 1000/255 = 3.9, losing precision for typical values quantized to 0 to 2
Transformers show 6 to 8 standard deviation outliers in 0.1 percent of activation channels, making full INT8 increase perplexity by 15 to 20 percent versus 2 percent with weight only quantization
Hardware without native INT8 support can show 1.2x slowdown after quantization due to dequantization overhead and scalar fallback, verify kernel availability before deployment
FP16 gradient underflow during mixed precision training causes NaN within 1000 to 5000 steps if loss scaling is too low, requiring dynamic scaling starting at 2^10 to 2^15
BF16 eliminates most underflow but can still cause instability in poorly conditioned problems, solution is keeping layer norm and loss computation in FP32 while using BF16 for matmuls
Per channel quantization for weights reduces error by 20 to 40 percent versus per tensor in convolutional and transformer layers where output channel magnitudes vary by 10 to 100 times
📌 Examples
GPT3 INT8 PTQ: Naive per tensor activation quantization causes 18 percent perplexity increase on WikiText. Switching to weight only INT8 with FP16 activations reduces degradation to 2 percent while maintaining 3.5x memory savings
BERT fine tuning with FP16: Training diverges at step 2400 with static loss scale of 128. Enabling dynamic loss scaling starting at 1024 and doubling every 2000 stable steps completes training with identical F1 score to FP32
ResNet50 on mobile NPU: INT8 PTQ with percentile clipping (99.95th) achieves 74.2 percent top 1 accuracy versus 76.1 percent FP32, but min max calibration drops to 71.8 percent due to outlier sensitivity
← Back to Model Quantization (INT8, FP16, Mixed Precision) Overview