ML Model OptimizationModel Pruning (Structured vs Unstructured)Hard⏱️ ~3 min

Failure Modes and Edge Cases in Production Pruning

Accuracy Collapse at High Sparsity

Accuracy degrades gracefully until a threshold, then collapses. For most networks, this cliff appears at 85-95% sparsity (unstructured) or 70-80% channel removal (structured). Symptoms: validation loss spikes during fine-tuning and never recovers; accuracy on hard examples drops to random while easy examples remain correct. Cause: the network loses representational capacity for fine-grained distinctions. Fix: back off sparsity by 10-15%, use longer fine-tuning, or accept the accuracy-size tradeoff.

Importance Score Miscalibration

Magnitude-based importance assumes small weights are unimportant. This fails when: weights are small but on a critical path (bottleneck layers); batch normalization rescales weights making magnitudes misleading; activation functions like ReLU mean some weights matter only for specific inputs. Symptoms: pruning removes weights that seemed unimportant but accuracy drops more than expected. Fix: use gradient-based importance (weight × gradient) instead of pure magnitude; it captures actual contribution to outputs.

Inference Slowdown After Unstructured Pruning

The model runs slower after pruning, not faster. Root cause: sparse matrix formats have storage overhead; conversion between sparse and dense formats adds latency; framework sparse operations aren"t optimized. A 50% sparse matrix in CSR format uses more memory than the original dense matrix for low sparsity. Fix: only use unstructured pruning above 80% sparsity where sparse formats become efficient, or use structured pruning instead.

💡 Detection: Always benchmark inference time before and after pruning on target hardware. Paper speedups don"t reflect real deployment.
💡 Key Takeaways
Accuracy cliff appears at 85-95% sparsity (unstructured) or 70-80% channel removal (structured)
Symptoms of over-pruning: loss spikes during fine-tuning, hard examples drop to random accuracy
Magnitude-based importance fails when batch norm rescales weights or weights are on critical paths
Gradient-based importance (weight × gradient) more reliable than pure magnitude
Sparse formats have overhead; only efficient above 80% sparsity
📌 Interview Tips
1Describe the accuracy cliff phenomenon with specific thresholds - shows you"ve hit these limits
2Explain gradient-based importance as alternative to magnitude when asked about pruning failures
3Emphasize always benchmarking on target hardware - paper speedups often don"t transfer
← Back to Model Pruning (Structured vs Unstructured) Overview