Communication Efficiency and Compression
The Communication Bottleneck
Communication, not computation, is the bottleneck in federated learning. A modern neural network has millions of parameters. Sending 10 million 32-bit floats requires 40MB per client per round. With 10,000 clients participating in each round, the server receives 400GB of updates. Over mobile networks with 1-5 Mbps upload speeds, transmitting 40MB takes 60-320 seconds per client. This makes naive federated learning impractical for any model larger than a few megabytes.
Gradient Compression Techniques
Quantization: Instead of sending 32-bit floats, quantize to 8-bit or even 1-bit values. 1-bit SGD sends only the sign of each gradient component, reducing communication by 32x with surprisingly small accuracy loss (typically 1-3%). Sparsification: Send only the largest gradient values and set rest to zero. Top-k sparsification keeps only the k largest gradients, often achieving 99% sparsity (100x compression) while maintaining convergence. Error feedback: Accumulate the gradients you did not send and add them to the next round. This prevents permanently losing small updates and is essential for sparsification to work.
Local Computation Trade-off
Another approach: do more computation locally before communicating. Instead of 1 local epoch per round, run 10 local epochs. This reduces communication rounds by 10x but introduces client drift. After many local updates, each client model diverges from others, making aggregation less effective. The optimal balance depends on network conditions and data heterogeneity. Typical production systems use 5-20 local epochs, with more heterogeneous data requiring fewer local steps.