ML Model Optimization • Model Quantization (INT8, FP16, Mixed Precision)Medium⏱️ ~3 min
Mixed Precision Training: FP16, BF16, and FP32 Accumulation
Mixed precision training stores master weights in FP32 but casts them to FP16 or BF16 (Brain Float 16) for forward and backward passes, then accumulates gradients and updates in FP32. This hybrid approach delivers 1.5 to 3 times training speedup on modern accelerators while preserving convergence. NVIDIA Tensor Cores provide up to 16 times higher FP16 throughput versus FP32 on A100 GPUs, and the memory savings enable 1.5 to 2 times larger batch sizes.
FP16 uses 1 sign bit, 5 exponent bits, and 10 mantissa bits, giving a range of roughly 6e-8 to 65504. Small gradients can underflow to zero, causing training to stall. Dynamic loss scaling multiplies the loss by a factor like 1024 before backpropagation, shifting gradients into FP16's representable range, then unscales before the optimizer step. BF16 uses 1 sign bit, 8 exponent bits (same as FP32), and 7 mantissa bits. It matches FP32's range from 1e-38 to 3e38 but with lower precision, eliminating most underflow issues without loss scaling.
Google TPUs natively support BF16, making it the default for large model training. The wider exponent range improves stability in transformer training where activation magnitudes vary widely across layers. In practice, BF16 training often requires no hyperparameter changes from FP32, while FP16 may need careful loss scale tuning and gradient clipping.
💡 Key Takeaways
•FP16 mixed precision on NVIDIA A100 delivers up to 16 times compute throughput versus FP32 for matrix multiplications, with 1.5 to 3 times end to end speedup when memory and overhead are included
•FP32 accumulation in matrix multiplies and FP32 master weights prevent accuracy loss from accumulated rounding errors, critical for training stability over thousands of iterations
•Dynamic loss scaling starts with a scale like 1024, increases by 2x every N steps if no overflow occurs, and halves immediately on NaN or Inf detection, adapting to gradient magnitude changes
•BF16 preserves FP32 exponent range, eliminating underflow for most models, but reduces mantissa precision from 23 bits to 7 bits, which is sufficient for gradient noise tolerance in large batch training
•Memory savings enable larger batches: FP16 halves activation memory, allowing batch size increase from 32 to 64, which can improve throughput by 1.3 to 1.5 times if compute bound
•Layer normalization, softmax, and loss computation are often kept in FP32 even in mixed precision to avoid numerical instability from reduced precision in sensitive operations
📌 Examples
NVIDIA A100: Training GPT3 style model with BF16 mixed precision achieves 2.2 times throughput versus FP32 with identical validation loss curves and no loss scaling required
Google TPU v4: BF16 native support trains BERT Large in 3 hours versus 7 hours in FP32, with 2 times larger batch size (512 vs 256) fitting in HBM (High Bandwidth Memory)
Meta's OPT 175B: Mixed precision training with FP16 Tensor Cores and FP32 accumulation, dynamic loss scaling initialized at 2^15, reduced training cost by 40 percent versus FP32 baseline