ML Model Optimization • Model Quantization (INT8, FP16, Mixed Precision)Medium⏱️ ~3 min
Choosing Quantization Strategy: Decision Framework
Selecting the right quantization approach depends on your bottleneck, hardware, and accuracy budget. For memory bound workloads where model size exceeds GPU RAM or network bandwidth dominates latency, aggressive quantization like INT4 or INT8 delivers outsized gains. A 175B parameter LLM at FP32 requires 700GB, impossible for single GPU inference, but INT8 reduces it to 175GB and INT4 to 87GB, enabling deployment. For compute bound workloads on hardware with strong FP16 or BF16 support, mixed precision training often yields larger speedups than INT8 inference.
Hardware capabilities dictate feasible formats. NVIDIA Tensor Cores accelerate FP16 and BF16 with FP32 accumulation, providing up to 16 times throughput gains. Google TPUs natively support BF16 with identical programming model to FP32. Mobile and embedded Neural Processing Units (NPUs) are optimized for INT8, making it the natural choice for edge deployment. If your target hardware lacks efficient low bit kernels, quantization may provide memory savings without speed improvements.
Start with the fastest path to production. For CNNs in vision tasks, try PTQ to INT8 first. If accuracy loss is under 1 percent, deploy immediately. For transformers and LLMs, begin with weight only INT8 or INT4 quantization keeping activations in FP16. If accuracy degrades beyond acceptable thresholds, invest in QAT for a few epochs to recover quality. Always profile end to end latency and memory on target hardware before committing to a quantization strategy.
💡 Key Takeaways
•Memory bound systems gain most from quantization: INT8 reduces 175B LLM from 700GB to 175GB, enabling single node inference versus impossible without quantization
•Compute bound training on NVIDIA A100 benefits more from FP16 mixed precision (16x matmul throughput) than INT8 inference (2 to 4x), prioritize mixed precision for training
•Edge NPUs deliver 3 to 5 times speedup with INT8 versus FP32 but often lack FP16 support, making INT8 PTQ or QAT the only practical path for mobile deployment
•Vision CNNs tolerate PTQ well: MobileNetV3 and EfficientNet show under 1 percent accuracy loss with INT8 PTQ, while transformers often need QAT to stay within 2 percent
•Cost trade-off example: INT4 LLM serving costs 8 times less memory bandwidth than FP32, saving $40K per month on inference at 1000 queries per second scale, justifying QAT investment
•Always validate on target hardware: INT8 on CPU without VNNI (Vector Neural Network Instructions) can be slower than FP32 due to software emulation, check kernel support before deployment
📌 Examples
Apple Neural Engine: iOS CoreML uses INT8 by default for on device inference, MobileNet and ResNet models achieve 4 to 5 times speedup versus FP32 on A15 NPU with under 1 percent accuracy loss
Meta LLaMA serving: INT8 weight only quantization on A100 reduces memory from 520GB to 130GB for 65B model, enabling 4 GPU inference versus 16 GPUs, cutting hardware cost by 75 percent
NVIDIA mixed precision: Training BERT Large with FP16 Tensor Cores completes in 3.2 days versus 7.1 days in FP32 on 8xV100, 2.2x speedup justifies maintaining FP32 master weights for stability