Privacy & Fairness in MLFederated LearningHard⏱️ ~3 min

Handling Non IID Data and Client Selection

Data is highly non Independent and Identically Distributed (non IID) across federated clients, which fundamentally slows convergence compared to centralized training. Heavy mobile users generate 100x more training samples than light users. Regional cohorts use different vocabulary or slang. Hospital imaging datasets vary by equipment manufacturer and patient demographics. When the global model aggregates updates from skewed distributions, it can oscillate or converge to poor local minima. Proximal regularization adds a term to the local objective that penalizes deviation from the global model parameters, effectively constraining each client update. Federated Averaging with Proximal (FedProx) uses a proximal coefficient like 0.01 to 0.1, which stabilizes training on heterogeneous data. Server side momentum or adaptive learning rates like Adam on the aggregated updates smooth out noisy gradients. Limiting per client learning rates to 0.001 to 0.01 prevents any single client from pulling the model too far in one round. Client selection strategies directly shape which distributions influence the model. Random sampling per round can still over represent heavy users. Stratified sampling ensures balanced representation across time zones, app versions, or usage tiers. Google reports rotating sampling across different cohorts to avoid bias. For cross device FL, the scheduler selects eligible clients by policy: on Wi Fi, charging, idle, recent app version, and opt in. Targeting nightly windows per time zone, systems invite 5,000 to 10,000 clients expecting 500 to 2,000 completions after dropout. Robust aggregation defends against outliers and adversarial updates. Coordinate wise median or trimmed mean discards extreme values before averaging. Krum selects the most central client updates based on pairwise distances, rejecting outliers. These methods reduce accuracy impact from poisoned clients but add computation cost. The tradeoff is convergence speed versus robustness: aggressive filtering can discard valid updates from rare but legitimate distributions, harming model recall on minority classes by 5 to 15 percent.
💡 Key Takeaways
Non IID data slows convergence because client distributions vary widely: heavy users generate 100x more samples, regional cohorts have different vocabulary, hospital equipment affects imaging quality
Proximal regularization with coefficient 0.01 to 0.1 constrains local updates to stay near global parameters, stabilizing training and reducing oscillation on heterogeneous data
Client selection policies for cross device FL include Wi Fi only, charging, idle, and nightly windows per time zone, with 5,000 to 10,000 invites targeting 500 to 2,000 completions
Robust aggregation methods like coordinate wise median or Krum defend against poisoned updates but can harm recall on minority classes by 5 to 15 percent if filtering is too aggressive
Stratified sampling ensures balanced representation across cohorts, preventing heavy users or dominant regions from skewing the global model toward their distribution
Server side momentum or adaptive learning rates like Adam smooth noisy aggregated gradients, improving convergence on non IID data by 10 to 30 percent compared to vanilla averaging
📌 Examples
Google Gboard uses proximal regularization with coefficient 0.01 and rotates sampling across time zones to avoid over representing any single region, improving rare word recall by 2 to 4 percent
Cross silo FL for fraud detection across 20 banks applies coordinate wise trimmed mean to discard outliers, as one bank with unusual transaction patterns could skew the global model
A keyboard model without stratified sampling over represents users typing 1,000 words per day versus 50 words per day, causing poor predictions for light users until balanced sampling is enforced
← Back to Federated Learning Overview
Handling Non IID Data and Client Selection | Federated Learning - System Overflow