Precision Tradeoffs: FP32 vs FP16 vs INT8
Understanding Precision Levels
FP32 (32-bit float): Full precision, baseline accuracy, no conversion needed. FP16 (16-bit float): Half memory, ~2x faster on modern GPUs with tensor cores, typically <0.5% accuracy loss. INT8 (8-bit integer): Quarter memory of FP32, 2-4x faster than FP16, requires calibration, 0.5-2% accuracy loss typical. The speedups compound with compilation optimizations.
INT8 Calibration
INT8 requires mapping the FP32 value range to 256 discrete levels. Calibration runs the model on representative data, collects activation ranges per layer, and computes optimal scaling factors. Poor calibration (unrepresentative data, too few samples) causes accuracy to collapse. Best practice: use 1000+ samples spanning input diversity. Some layers (final classifier, attention) are sensitive; keep them in FP16 while quantizing others to INT8 (mixed precision).
When Each Precision Makes Sense
FP32: Training, debugging, accuracy-critical production where latency isn"t an issue. FP16: Default for GPU inference; nearly free speedup with minimal risk. INT8: High-throughput serving, edge deployment, cost-sensitive inference. Mixed precision: Best accuracy-speed tradeoff for complex models; sensitive layers stay FP16, bulk of computation in INT8.
Accuracy Validation Protocol
Compare against FP32 baseline on held-out data. Acceptable thresholds: FP16 should be within 0.1% of FP32; INT8 within 0.5-1%. Test edge cases and low-confidence predictions specifically; quantization errors concentrate where activations are small or near decision boundaries.