ML Model Optimization • Model Compilation (TensorRT, ONNX, TVM)Medium⏱️ ~3 min
Precision Tradeoffs: FP32 vs FP16 vs INT8
Reduced precision arithmetic is one of the most impactful optimizations in model compilation. Modern GPUs have specialized hardware for lower precision formats: tensor cores accelerate FP16 and INT8 matrix multiplications, delivering 2x to 16x higher throughput compared to FP32. The challenge is balancing speed gains against accuracy loss. FP16 is almost always a free win for inference, typically staying within measurement noise for standard vision and language models. INT8 offers larger gains but requires calibration and can degrade accuracy if not tuned correctly.
FP16 reduces memory bandwidth by half and leverages tensor core acceleration on NVIDIA GPUs. For ResNet class models, FP16 typically achieves 2x to 3x higher throughput with top 1 accuracy changes under 0.1 percentage points, often within the noise of different random seeds. The conversion is straightforward: cast weights and activations to half precision and let the hardware handle the rest. Edge cases include numerical instability in some layer types, such as batch normalization or layer normalization, where mixed precision keeps accumulators in FP32 while computing in FP16.
INT8 quantization maps floating point values to 8 bit integers, reducing memory footprint by 4x and enabling integer arithmetic units. Calibration is the critical step: you run representative data through the model, collect per layer activation statistics, and compute scale factors that minimize quantization error. Well tuned INT8 for ResNet50 stays within 0.5 to 1.0 percentage points of FP32 top 1 accuracy. Poorly chosen calibration data, however, can cause 5 to 10 point drops or produce zero valued outputs in detection heads.
The performance gains are substantial. On an RTX A4000, TensorRT INT8 delivers roughly 12x higher throughput than FP32 baselines for CNNs, and 3x compared to FP16. For large language models on A100 GPUs, INT8 inference can double tokens per second while staying within acceptable perplexity bounds. The catch is engineering complexity: you must maintain calibration datasets, validate accuracy for each model variant, and implement fallback paths when quantization fails. Production teams typically build FP32, FP16, and INT8 artifacts, A/B test them, and choose based on accuracy tolerance and cost constraints.
💡 Key Takeaways
•FP16 is almost always a free win: 2x to 3x throughput gain with accuracy loss under 0.1 percentage points for standard vision and language models, leveraging tensor cores on modern GPUs
•INT8 delivers 4x memory reduction and up to 12x throughput on RTX A4000 for CNNs, but requires careful calibration with representative data
•Well tuned INT8 for ResNet50 stays within 0.5 to 1.0 percentage points of FP32 top 1 accuracy; poor calibration can cause 5 to 10 point drops or zero valued outputs
•Large language model acceleration: INT8 inference doubles tokens per second on A100 GPUs while maintaining acceptable perplexity, enabling cost effective serving
•Production workflow: build FP32, FP16, and INT8 artifacts, A/B test accuracy and latency, select based on tolerance and cost constraints
•Failure mode: calibration data mismatch where clean lab data does not match production distributions causes accuracy degradation that offline tests miss
📌 Examples
Meta compiles content moderation models to TensorRT INT8 on T4 GPUs, achieving sub 10ms p99 latency while maintaining F1 score within 1% of FP32 baselines
NVIDIA Jetson deployments use INT8 for real time object detection, fitting models in limited memory and hitting 30 FPS targets within 10 watt power budgets
Google TPU inference uses INT8 quantization for production serving of large scale ranking models, reducing serving cost by 4x with minimal impact on relevance metrics