Production Deployment and Failure Modes
Client Reliability Challenges
Production federated learning must handle unreliable clients. Mobile devices go offline, lose network connectivity, run out of battery, or simply close the app. In a typical round with 10,000 selected clients, expect 10-30% to fail before sending their updates. This is not an edge case; it is the normal operating condition. If secure aggregation requires all selected clients to participate, a single dropout breaks the entire round. Systems must be designed assuming constant partial failures.
Handling Stragglers and Dropouts
Deadline-based aggregation: Do not wait for all clients. Set a deadline (5-10 minutes) and aggregate whatever updates arrive. This sacrifices some training data but keeps rounds completing. Over-selection: If you need 1,000 clients for a valid round, select 1,500 initially, expecting dropouts. First 1,000 to respond get included. Asynchronous updates: Instead of synchronized rounds, accept updates continuously. Maintains a running aggregate that incorporates updates as they arrive. This eliminates waiting but introduces staleness: updates are computed on an older model version. Staleness beyond 5-10 rounds typically hurts convergence.
Security Failure Modes
Byzantine clients: Malicious clients send poisoned updates to corrupt the model. Even one client sending gradients pointing toward misclassification can shift the aggregate. Defenses include robust aggregation (median or trimmed mean instead of average) and anomaly detection on update statistics. Model inversion: Even aggregated models leak information. Given enough model queries, attackers can reconstruct training examples. Mitigation requires combining differential privacy with output perturbation. Free-riding: Clients that receive model updates without contributing genuine training. Detection relies on statistical validation of update quality.