Critical Tradeoffs: When to Choose Each Pruning Style
When to Choose Structured Pruning
Use structured pruning when: deploying to standard GPUs or CPUs without sparse accelerators; latency is the primary concern; you need guaranteed speedups. Target 40-60% channel reduction for 1.5-2x speedup with less than 1% accuracy loss. Beyond 70% removal, accuracy drops sharply (3-5% typical). Best for convolutional networks where channel pruning is natural.
When to Choose Unstructured Pruning
Use unstructured pruning when: model size (memory, storage) matters more than inference speed; deploying to sparse-aware hardware (newer GPUs with sparse tensor cores, specialized accelerators); extreme compression is needed. 90-95% sparsity achievable with 1-2% accuracy loss. The catch: without sparse hardware, you only save memory, not compute time.
Pruning vs Other Optimization
Pruning vs Quantization: Complementary techniques. Prune first to reduce parameters, then quantize remaining weights. Combined, achieve 10-20x compression with 2-3x speedup. Pruning vs Distillation: Distillation trains a smaller architecture from scratch; pruning shrinks an existing one. Distillation often achieves better accuracy-size tradeoffs but requires more training compute. Use pruning when you want to preserve specific weights or lack resources for full retraining.
Architecture Compatibility
Transformers respond well to unstructured pruning (attention heads can be sparse). CNNs prefer structured (filter pruning maps to hardware efficiently). For mixed architectures, prune convolutions structured and attention unstructured, then convert unstructured portions only if sparse hardware is available.