Failure Modes and Edge Cases in Production Pruning
Accuracy Collapse at High Sparsity
Accuracy degrades gracefully until a threshold, then collapses. For most networks, this cliff appears at 85-95% sparsity (unstructured) or 70-80% channel removal (structured). Symptoms: validation loss spikes during fine-tuning and never recovers; accuracy on hard examples drops to random while easy examples remain correct. Cause: the network loses representational capacity for fine-grained distinctions. Fix: back off sparsity by 10-15%, use longer fine-tuning, or accept the accuracy-size tradeoff.
Importance Score Miscalibration
Magnitude-based importance assumes small weights are unimportant. This fails when: weights are small but on a critical path (bottleneck layers); batch normalization rescales weights making magnitudes misleading; activation functions like ReLU mean some weights matter only for specific inputs. Symptoms: pruning removes weights that seemed unimportant but accuracy drops more than expected. Fix: use gradient-based importance (weight × gradient) instead of pure magnitude; it captures actual contribution to outputs.
Inference Slowdown After Unstructured Pruning
The model runs slower after pruning, not faster. Root cause: sparse matrix formats have storage overhead; conversion between sparse and dense formats adds latency; framework sparse operations aren"t optimized. A 50% sparse matrix in CSR format uses more memory than the original dense matrix for low sparsity. Fix: only use unstructured pruning above 80% sparsity where sparse formats become efficient, or use structured pruning instead.