Handling Non IID Data and Client Selection
The Non-IID Challenge
IID (Independent and Identically Distributed) data means each sample comes from the same underlying distribution. Traditional ML assumes this. In federated learning, it is almost never true. A keyboard app user who only types in Spanish has completely different data than one who types technical English. A hospital in a wealthy urban area sees different diseases than a rural clinic. When you average model updates from clients with wildly different data distributions, the result often diverges or converges to a model that works poorly for everyone. Studies show non-IID data can degrade model accuracy by 20-50% compared to centralized training on the same data.
Types of Non-IID Distribution
Label skew: Different clients have different label distributions. One phone user mostly types work emails; another sends casual messages. The word frequency differs dramatically. Feature skew: Same labels but different feature patterns. Two users both type greetings, but one uses formal language and another uses slang. Quantity skew: Some clients have 10,000 samples; others have 50. Heavy users dominate training if you weight equally. Temporal skew: User behavior changes over time. Weekend typing patterns differ from weekday ones.
Client Selection Strategies
Not all clients should participate in every round. Random selection ensures fairness but ignores data quality. Stratified selection: Sample clients proportionally from different groups to ensure diverse data coverage. If 30% of users speak Spanish, ensure roughly 30% of each round includes Spanish users. Importance sampling: Weight client selection by data characteristics. Clients with rare but important data (unusual medical conditions, minority languages) get selected more often. Active learning: Select clients whose updates would most improve the model, measured by gradient magnitude or uncertainty. This can improve convergence by 2-3x but risks overfitting to edge cases.