ML Model OptimizationModel Pruning (Structured vs Unstructured)Hard⏱️ ~3 min

Critical Tradeoffs: When to Choose Each Pruning Style

When to Choose Structured Pruning

Use structured pruning when: deploying to standard GPUs or CPUs without sparse accelerators; latency is the primary concern; you need guaranteed speedups. Target 40-60% channel reduction for 1.5-2x speedup with less than 1% accuracy loss. Beyond 70% removal, accuracy drops sharply (3-5% typical). Best for convolutional networks where channel pruning is natural.

When to Choose Unstructured Pruning

Use unstructured pruning when: model size (memory, storage) matters more than inference speed; deploying to sparse-aware hardware (newer GPUs with sparse tensor cores, specialized accelerators); extreme compression is needed. 90-95% sparsity achievable with 1-2% accuracy loss. The catch: without sparse hardware, you only save memory, not compute time.

Pruning vs Other Optimization

Pruning vs Quantization: Complementary techniques. Prune first to reduce parameters, then quantize remaining weights. Combined, achieve 10-20x compression with 2-3x speedup. Pruning vs Distillation: Distillation trains a smaller architecture from scratch; pruning shrinks an existing one. Distillation often achieves better accuracy-size tradeoffs but requires more training compute. Use pruning when you want to preserve specific weights or lack resources for full retraining.

⚠️ Decision Framework: If target hardware has sparse acceleration → unstructured. If not → structured. If unsure about hardware → structured (safer bet). Always combine with quantization for maximum benefit.

Architecture Compatibility

Transformers respond well to unstructured pruning (attention heads can be sparse). CNNs prefer structured (filter pruning maps to hardware efficiently). For mixed architectures, prune convolutions structured and attention unstructured, then convert unstructured portions only if sparse hardware is available.

💡 Key Takeaways
Structured: 40-60% channel removal gives 1.5-2x speedup with <1% accuracy loss; beyond 70% accuracy drops sharply
Unstructured: 90-95% sparsity achievable but only saves memory without sparse hardware
Pruning + quantization together: 10-20x compression with 2-3x speedup
Distillation achieves better accuracy-size tradeoffs but needs more training compute than pruning
Transformers suit unstructured pruning; CNNs suit structured pruning
📌 Interview Tips
1Give the decision framework: sparse hardware → unstructured, standard hardware → structured
2Mention combining pruning with quantization for maximum benefit - shows systems thinking
3Discuss architecture-specific choices (transformers vs CNNs) to demonstrate breadth
← Back to Model Pruning (Structured vs Unstructured) Overview