Training Batch Size: Memory, Convergence, and Throughput Trade-offs
Memory Constraints
Training memory consists of: model parameters, gradients (same size as parameters), optimizer state (2x parameters for Adam), and activations (proportional to batch size and model depth). A 1B parameter model needs ~12GB just for parameters, gradients, and optimizer. Activations for batch size 32 might add 20GB. Maximum batch size is constrained by GPU memory; if you run out, the options are: gradient checkpointing (recompute activations, trade compute for memory), gradient accumulation (simulate larger batches), or model parallelism.
Convergence Effects
Batch size affects gradient noise. Small batches (8-32) have noisy gradients that can escape local minima but may not converge smoothly. Large batches (512-4096) have stable gradients but can converge to sharp minima that generalize poorly. The learning rate must scale with batch size: linear scaling rule says if you double batch size, double learning rate. However, this breaks above certain thresholds (2048-8192 depending on model). Large-batch training requires warmup and careful hyperparameter tuning.
Throughput Optimization
Larger batches improve GPU utilization: more parallel compute, better memory bandwidth utilization, fewer gradient synchronization steps in distributed training. A batch of 256 might train 10x faster per epoch than a batch of 16, but converge to worse accuracy. The goal: find the largest batch size that still converges well, then tune learning rate and warmup accordingly. Typical approach: start with published batch sizes, increase gradually while monitoring validation loss.