Failure Modes: When Adversarial Defenses Break in Production
Overfitting to Training Attacks
Adversarial training defends against the specific attack types used during training. Attackers who discover your training methodology can craft novel attacks outside your robustness envelope. A model robust to gradient-based perturbations may be vulnerable to decision-based or transfer attacks. Defense requires diversity—train against multiple attack types.
Warning: Never assume your adversarial training covers all possible attacks. Attackers are creative. Maintain red team exercises that try novel attack strategies against production systems.
Perturbation Bound Mismatch
If training perturbation bounds do not match real attacker capabilities, defenses fail. Bounds too tight: model is not robust to realistic attacks. Bounds too loose: model sacrifices too much accuracy defending against unrealistic attacks. Analyze actual attack data to calibrate perturbation bounds—do not guess.
Gradient Masking
Some defenses make gradients useless for attack generation without actually improving robustness. The model appears robust because gradient-based attacks fail, but decision-based or transfer attacks still succeed. Test robustness with multiple attack methods, not just gradient-based ones.
Detection Strategy: Compare performance of gradient-based attacks (FGSM, PGD) versus decision-based attacks (boundary attack). If decision-based attacks succeed where gradient attacks fail, you have gradient masking, not true robustness.
Computational Arms Race
Attackers can invest more compute than defenders. Ensemble defenses are expensive; attackers can probe until they find transferable attacks that fool all ensemble members. Defense depth matters: do not rely on a single robust model, layer multiple independent detection mechanisms.