Production Deployment and Failure Modes

Client Reliability Challenges
Production federated learning must handle unreliable clients. Mobile devices go offline, lose network connectivity, run out of battery, or simply close the app. In a typical round with 10,000 selected clients, expect 10-30% to fail before sending their updates. This is not an edge case; it is the normal operating condition. If secure aggregation requires all selected clients to participate, a single dropout breaks the entire round. Systems must be designed assuming constant partial failures.
Handling Stragglers and Dropouts
Deadline-based aggregation: Do not wait for all clients. Set a deadline (5-10 minutes) and aggregate whatever updates arrive. This sacrifices some training data but keeps rounds completing. Over-selection: If you need 1,000 clients for a valid round, select 1,500 initially, expecting dropouts. First 1,000 to respond get included. Asynchronous updates: Instead of synchronized rounds, accept updates continuously. Maintains a running aggregate that incorporates updates as they arrive. This eliminates waiting but introduces staleness: updates are computed on an older model version. Staleness beyond 5-10 rounds typically hurts convergence.
Security Failure Modes
Byzantine clients: Malicious clients send poisoned updates to corrupt the model. Even one client sending gradients pointing toward misclassification can shift the aggregate. Defenses include robust aggregation (median or trimmed mean instead of average) and anomaly detection on update statistics. Model inversion: Even aggregated models leak information. Given enough model queries, attackers can reconstruct training examples. Mitigation requires combining differential privacy with output perturbation. Free-riding: Clients that receive model updates without contributing genuine training. Detection relies on statistical validation of update quality.
💡 Key Insight: Federated systems must work correctly when 10-30% of components fail every round. Design for failure as the default state, not the exception.

💡 Key Takeaways

✓Expect 10-30% of clients to fail each round due to network issues, battery, or app closure

✓Deadline-based aggregation sets time limits rather than waiting for all clients

✓Over-selection compensates for dropouts: select 1,500 clients to get 1,000 valid responses

✓Asynchronous updates eliminate waiting but introduce staleness beyond 5-10 rounds

✓Byzantine clients require robust aggregation (median/trimmed mean) instead of simple averaging

📌 Interview Tips

1Emphasize that 10-30% failure per round is normal operation, not edge case

2When discussing security, mention model inversion attacks can reconstruct training data from aggregate models

← Back to Federated Learning Overview