Mixed Precision Training: FP16, BF16, and FP32 Accumulation
Mixed Precision Training
Mixed precision training uses FP16 (16-bit floating point) for most computations while keeping critical operations in FP32. This provides most of the speed benefits of FP16 while avoiding accuracy loss from precision limitations.
FP16 vs BF16
FP16 (Half Precision): 5 bits for exponent, 10 bits for mantissa. Range is limited: values above 65,504 overflow. Works well for most deep learning but requires loss scaling to prevent underflow.
BF16 (Brain Float): 8 bits for exponent, 7 bits for mantissa. Same range as FP32 but less precision. No loss scaling needed. Native support on newer hardware (A100, H100 GPUs).
Trade-off: FP16 has more precision but limited range. BF16 has less precision but FP32-compatible range. For training, BF16 is often easier to use because gradient values naturally fit without scaling tricks.
FP32 Accumulation
Matrix multiplications produce accumulated sums that can overflow FP16 precision. The solution: compute multiplications in FP16, accumulate results in FP32. This is called FP32 accumulation and is standard in mixed precision training.
What stays in FP32: Loss computation, optimizer state (momentum, adaptive learning rates), batch normalization statistics. These operations are numerically sensitive and constitute less than 5% of total compute.
Performance Gains
Mixed precision training achieves 1.5-3x speedup on modern GPUs. Memory usage drops by nearly half. These gains come from faster FP16 compute (2x FLOPS) and reduced memory bandwidth (half the data transferred).