Handling Non IID Data and Client Selection

The Non-IID Challenge
IID (Independent and Identically Distributed) data means each sample comes from the same underlying distribution. Traditional ML assumes this. In federated learning, it is almost never true. A keyboard app user who only types in Spanish has completely different data than one who types technical English. A hospital in a wealthy urban area sees different diseases than a rural clinic. When you average model updates from clients with wildly different data distributions, the result often diverges or converges to a model that works poorly for everyone. Studies show non-IID data can degrade model accuracy by 20-50% compared to centralized training on the same data.
Types of Non-IID Distribution
Label skew: Different clients have different label distributions. One phone user mostly types work emails; another sends casual messages. The word frequency differs dramatically. Feature skew: Same labels but different feature patterns. Two users both type greetings, but one uses formal language and another uses slang. Quantity skew: Some clients have 10,000 samples; others have 50. Heavy users dominate training if you weight equally. Temporal skew: User behavior changes over time. Weekend typing patterns differ from weekday ones.
Client Selection Strategies
Not all clients should participate in every round. Random selection ensures fairness but ignores data quality. Stratified selection: Sample clients proportionally from different groups to ensure diverse data coverage. If 30% of users speak Spanish, ensure roughly 30% of each round includes Spanish users. Importance sampling: Weight client selection by data characteristics. Clients with rare but important data (unusual medical conditions, minority languages) get selected more often. Active learning: Select clients whose updates would most improve the model, measured by gradient magnitude or uncertainty. This can improve convergence by 2-3x but risks overfitting to edge cases.
💡 Key Insight: The client selection algorithm directly shapes what the model learns. Biased selection creates biased models. If power users are always available and casual users rarely are, the model optimizes for power users.

💡 Key Takeaways

✓Non-IID data (clients with different distributions) can degrade accuracy by 20-50% versus centralized training

✓Four types of skew: label skew, feature skew, quantity skew, and temporal skew

✓Random client selection ensures fairness but ignores data quality and coverage

✓Stratified selection samples proportionally from groups to ensure diverse data representation

✓Client selection algorithm directly shapes model bias: selecting power users creates power-user models

📌 Interview Tips

1Explain non-IID with concrete example: Spanish-only users versus technical English users have completely different word distributions

2Mention that active client selection can improve convergence 2-3x but risks overfitting to edge cases

← Back to Federated Learning Overview