ML Model OptimizationModel Quantization (INT8, FP16, Mixed Precision)Medium⏱️ ~3 min

Mixed Precision Training: FP16, BF16, and FP32 Accumulation

Mixed Precision Training

Mixed precision training uses FP16 (16-bit floating point) for most computations while keeping critical operations in FP32. This provides most of the speed benefits of FP16 while avoiding accuracy loss from precision limitations.

FP16 vs BF16

FP16 (Half Precision): 5 bits for exponent, 10 bits for mantissa. Range is limited: values above 65,504 overflow. Works well for most deep learning but requires loss scaling to prevent underflow.

BF16 (Brain Float): 8 bits for exponent, 7 bits for mantissa. Same range as FP32 but less precision. No loss scaling needed. Native support on newer hardware (A100, H100 GPUs).

Trade-off: FP16 has more precision but limited range. BF16 has less precision but FP32-compatible range. For training, BF16 is often easier to use because gradient values naturally fit without scaling tricks.

FP32 Accumulation

Matrix multiplications produce accumulated sums that can overflow FP16 precision. The solution: compute multiplications in FP16, accumulate results in FP32. This is called FP32 accumulation and is standard in mixed precision training.

What stays in FP32: Loss computation, optimizer state (momentum, adaptive learning rates), batch normalization statistics. These operations are numerically sensitive and constitute less than 5% of total compute.

Performance Gains

Mixed precision training achieves 1.5-3x speedup on modern GPUs. Memory usage drops by nearly half. These gains come from faster FP16 compute (2x FLOPS) and reduced memory bandwidth (half the data transferred).

💡 Key Takeaways
FP16 has limited range (max 65,504) requiring loss scaling; BF16 has FP32-compatible range but less precision
FP32 accumulation prevents overflow - multiply in FP16, sum in FP32
Critical operations stay in FP32: loss computation, optimizer state, batch norm statistics (<5% of compute)
Mixed precision achieves 1.5-3x speedup and ~50% memory reduction on modern GPUs
📌 Interview Tips
1Interview Tip: Explain BF16 as the easier choice - same range as FP32 means no loss scaling gymnastics
2Interview Tip: Mention that FP32 accumulation is automatic in modern frameworks - just enable mixed precision
← Back to Model Quantization (INT8, FP16, Mixed Precision) Overview