ML Model OptimizationModel Quantization (INT8, FP16, Mixed Precision)Medium⏱️ ~2 min

Post Training Quantization vs Quantization Aware Training

Post Training Quantization (PTQ) converts a fully trained model to lower precision without retraining. You select a small calibration dataset, run inference to collect activation statistics like min and max values, compute scale and zero point parameters per tensor or per channel, then convert weights and set up activation quantization. PTQ is fast to deploy, requiring only minutes to hours for calibration, and works well for models with robust features like CNNs in vision tasks. Quantization Aware Training (QAT) simulates quantization effects during training by inserting fake quantization nodes in the forward pass. These nodes quantize and immediately dequantize values, injecting quantization noise while allowing gradients to flow via straight through estimators. The model learns to compensate for reduced precision, typically recovering 1 to 2 percent accuracy that PTQ might lose. QAT requires full or partial retraining, adding days to training time, but becomes essential for sensitive architectures like transformers where PTQ degrades quality. The choice depends on your accuracy budget and timeline. For a ResNet50 image classifier, PTQ to INT8 might lose only 0.5 percent top 1 accuracy in 30 minutes of calibration. For a BERT language model, PTQ could drop F1 score by 3 to 5 percent, while QAT with 2 to 3 days of fine tuning recovers that loss. Production systems often start with PTQ for speed, then invest in QAT if accuracy gaps appear.
💡 Key Takeaways
PTQ requires only calibration data (typically 500 to 1000 samples) and completes in minutes to hours, making it ideal for rapid deployment without retraining infrastructure
QAT adds 10 to 30 percent to training time but recovers 1 to 2 percent accuracy loss, critical for transformers where PTQ can degrade BERT F1 scores by 3 to 5 percent
Dynamic PTQ computes activation scales per batch at runtime, adding 5 to 10 percent latency overhead but improving accuracy when calibration data is unrepresentative
Static PTQ precomputes all scales offline for zero runtime overhead, suitable when calibration captures production distribution well, common in vision models like MobileNet
QAT uses straight through estimators to backpropagate through discrete quantization, allowing the optimizer to find weight values that minimize loss under quantization noise
Per channel quantization for weights reduces error versus per tensor by 20 to 40 percent in layers with large weight variance across output channels
📌 Examples
YOLO object detection: PTQ to FP16 on V100 GPU reduces latency from 15ms to 9ms with 0.3 mAP (mean Average Precision) drop, while QAT fine tuning for 5 epochs recovers full mAP
GPT style LLMs: PTQ to INT8 activations causes perplexity increase of 15 to 20 percent, so production systems use weight only INT4 quantization with FP16 activations to maintain quality
MobileNetV3 on edge devices: Static PTQ to INT8 achieves 2.5 times speedup on ARM NPUs with 0.5 percent accuracy loss, no QAT needed due to robust architecture
← Back to Model Quantization (INT8, FP16, Mixed Precision) Overview