Communication Efficiency and Compression

The Communication Bottleneck
Communication, not computation, is the bottleneck in federated learning. A modern neural network has millions of parameters. Sending 10 million 32-bit floats requires 40MB per client per round. With 10,000 clients participating in each round, the server receives 400GB of updates. Over mobile networks with 1-5 Mbps upload speeds, transmitting 40MB takes 60-320 seconds per client. This makes naive federated learning impractical for any model larger than a few megabytes.
Gradient Compression Techniques
Quantization: Instead of sending 32-bit floats, quantize to 8-bit or even 1-bit values. 1-bit SGD sends only the sign of each gradient component, reducing communication by 32x with surprisingly small accuracy loss (typically 1-3%). Sparsification: Send only the largest gradient values and set rest to zero. Top-k sparsification keeps only the k largest gradients, often achieving 99% sparsity (100x compression) while maintaining convergence. Error feedback: Accumulate the gradients you did not send and add them to the next round. This prevents permanently losing small updates and is essential for sparsification to work.
Local Computation Trade-off
Another approach: do more computation locally before communicating. Instead of 1 local epoch per round, run 10 local epochs. This reduces communication rounds by 10x but introduces client drift. After many local updates, each client model diverges from others, making aggregation less effective. The optimal balance depends on network conditions and data heterogeneity. Typical production systems use 5-20 local epochs, with more heterogeneous data requiring fewer local steps.
⚠️ Key Trade-off: Compression and local computation reduce communication but hurt convergence. 1-bit quantization saves 32x bandwidth but may require 2-3x more rounds to reach the same accuracy. The total communication might be similar, but wall-clock time improves due to parallelism.

💡 Key Takeaways

✓Communication is the bottleneck: 10M parameters at 32-bit equals 40MB per client per round

✓Quantization reduces precision (32-bit to 8-bit or 1-bit) achieving 32x compression with 1-3% accuracy loss

✓Sparsification sends only top-k gradients, achieving 99% sparsity (100x compression)

✓Error feedback accumulates unsent gradients for future rounds, preventing permanent information loss

✓More local epochs reduce rounds but cause client drift; 5-20 local epochs is typical production range

📌 Interview Tips

1Quantify the communication problem: 40MB per client times 10,000 clients equals 400GB per round

2Explain that compression may not reduce total communication but improves wall-clock time through parallelism

← Back to Federated Learning Overview