Critical Tradeoffs: When to Choose Each Pruning Style

When to Choose Structured Pruning
Use structured pruning when: deploying to standard GPUs or CPUs without sparse accelerators; latency is the primary concern; you need guaranteed speedups. Target 40-60% channel reduction for 1.5-2x speedup with less than 1% accuracy loss. Beyond 70% removal, accuracy drops sharply (3-5% typical). Best for convolutional networks where channel pruning is natural.
When to Choose Unstructured Pruning
Use unstructured pruning when: model size (memory, storage) matters more than inference speed; deploying to sparse-aware hardware (newer GPUs with sparse tensor cores, specialized accelerators); extreme compression is needed. 90-95% sparsity achievable with 1-2% accuracy loss. The catch: without sparse hardware, you only save memory, not compute time.
Pruning vs Other Optimization
Pruning vs Quantization: Complementary techniques. Prune first to reduce parameters, then quantize remaining weights. Combined, achieve 10-20x compression with 2-3x speedup. Pruning vs Distillation: Distillation trains a smaller architecture from scratch; pruning shrinks an existing one. Distillation often achieves better accuracy-size tradeoffs but requires more training compute. Use pruning when you want to preserve specific weights or lack resources for full retraining.
⚠️ Decision Framework: If target hardware has sparse acceleration → unstructured. If not → structured. If unsure about hardware → structured (safer bet). Always combine with quantization for maximum benefit.
Architecture Compatibility
Transformers respond well to unstructured pruning (attention heads can be sparse). CNNs prefer structured (filter pruning maps to hardware efficiently). For mixed architectures, prune convolutions structured and attention unstructured, then convert unstructured portions only if sparse hardware is available.

💡 Key Takeaways

✓Structured: 40-60% channel removal gives 1.5-2x speedup with <1% accuracy loss; beyond 70% accuracy drops sharply

✓Unstructured: 90-95% sparsity achievable but only saves memory without sparse hardware

✓Pruning + quantization together: 10-20x compression with 2-3x speedup

✓Distillation achieves better accuracy-size tradeoffs but needs more training compute than pruning

✓Transformers suit unstructured pruning; CNNs suit structured pruning

📌 Interview Tips

1Give the decision framework: sparse hardware → unstructured, standard hardware → structured

2Mention combining pruning with quantization for maximum benefit - shows systems thinking

3Discuss architecture-specific choices (transformers vs CNNs) to demonstrate breadth

← Back to Model Pruning (Structured vs Unstructured) Overview