Post Training Quantization vs Quantization Aware Training
Post-Training Quantization
PTQ converts a trained FP32 model to lower precision without retraining. The model is already complete; quantization is applied as a post-processing step. This is the fastest path to deployment but may sacrifice accuracy.
How it works: Analyze model weights to find their range (minimum and maximum values). Map this range to 256 integer values (for INT8). For activations, run calibration data through the model to measure typical activation ranges. Create scaling factors that map floating point values to integers.
Accuracy impact: Simple models lose 0.5-2% accuracy. Complex models with wide value ranges may lose 5-10% or fail entirely. The quantization process cannot fix a model that fundamentally needs high precision in certain layers.
Quantization Aware Training
QAT simulates quantization during training. The model learns to work with reduced precision from the start, adapting its weights to maintain accuracy despite quantization errors.
How it works: Insert fake quantization operations in the forward pass. Weights are quantized then immediately dequantized. The model sees quantization noise during training. Gradients flow through the fake quantization using straight-through estimator (pretend the quantization is identity for backpropagation).
Accuracy impact: Typically matches FP32 accuracy within 1%. More robust than PTQ for challenging models. The model has learned to be quantization-tolerant.
When to Choose Each
Use PTQ when: Time is critical (PTQ takes hours, QAT takes days). Model already quantizes well (validate on test set). Accuracy loss is acceptable.
Use QAT when: PTQ causes unacceptable accuracy loss. Model architecture is complex. You are deploying to accuracy-sensitive applications.