ML Model OptimizationModel Quantization (INT8, FP16, Mixed Precision)Medium⏱️ ~2 min

Post Training Quantization vs Quantization Aware Training

Post-Training Quantization

PTQ converts a trained FP32 model to lower precision without retraining. The model is already complete; quantization is applied as a post-processing step. This is the fastest path to deployment but may sacrifice accuracy.

How it works: Analyze model weights to find their range (minimum and maximum values). Map this range to 256 integer values (for INT8). For activations, run calibration data through the model to measure typical activation ranges. Create scaling factors that map floating point values to integers.

Accuracy impact: Simple models lose 0.5-2% accuracy. Complex models with wide value ranges may lose 5-10% or fail entirely. The quantization process cannot fix a model that fundamentally needs high precision in certain layers.

Quantization Aware Training

QAT simulates quantization during training. The model learns to work with reduced precision from the start, adapting its weights to maintain accuracy despite quantization errors.

How it works: Insert fake quantization operations in the forward pass. Weights are quantized then immediately dequantized. The model sees quantization noise during training. Gradients flow through the fake quantization using straight-through estimator (pretend the quantization is identity for backpropagation).

Accuracy impact: Typically matches FP32 accuracy within 1%. More robust than PTQ for challenging models. The model has learned to be quantization-tolerant.

When to Choose Each

Use PTQ when: Time is critical (PTQ takes hours, QAT takes days). Model already quantizes well (validate on test set). Accuracy loss is acceptable.

Use QAT when: PTQ causes unacceptable accuracy loss. Model architecture is complex. You are deploying to accuracy-sensitive applications.

💡 Key Takeaways
PTQ converts trained models to lower precision without retraining - fastest deployment path
PTQ loses 0.5-2% accuracy on simple models; 5-10% on complex models with wide value ranges
QAT simulates quantization during training - model learns to tolerate reduced precision
QAT typically matches FP32 accuracy within 1% but requires days of additional training
📌 Interview Tips
1Interview Tip: Start with PTQ for speed - only invest in QAT if PTQ accuracy is unacceptable
2Interview Tip: Mention calibration data for PTQ - representative samples determine activation ranges
← Back to Model Quantization (INT8, FP16, Mixed Precision) Overview
Post Training Quantization vs Quantization Aware Training | Model Quantization (INT8, FP16, Mixed Precision) - System Overflow