Learn→ML Model Optimization→Model Pruning (Structured vs Unstructured)→2 of 6

ML Model Optimization • Model Pruning (Structured vs Unstructured)Medium⏱️ ~3 min

Hardware Efficiency and Speedup Characteristics

Core Concept
Magnitude pruning is the simplest and most common technique: remove weights closest to zero. The intuition: small weights contribute little to outputs, so removing them minimally affects predictions.
How Magnitude Pruning Works
After training, rank all weights by absolute value. Set the smallest X% to zero permanently. For structured pruning, compute an importance score per channel (L1 norm of channel weights is common) and remove lowest-scoring channels. The process: train → rank → prune → fine-tune. Fine-tuning is essential; accuracy drops 5-15% immediately after pruning but recovers to within 1-2% of original after 10-20% of original training epochs.
Why GPUs Don"t Speed Up Sparse Matrices
GPUs achieve speed through parallelism: computing thousands of multiply-adds simultaneously. Sparse matrices break this pattern. To multiply a sparse matrix, the GPU must: identify non-zero positions, gather those values, compute products, scatter results back. This coordination overhead often exceeds the compute savings. A 90% sparse matrix multiplication can run slower than dense. Only specialized sparse tensor cores (available on newer hardware) handle sparsity efficiently.
Structured Pruning Speedups
Removing channels physically reduces matrix dimensions. A 256-channel layer pruned to 128 channels uses matrices half the size. Standard GEMM operations run faster with smaller matrices, no special hardware needed. Speedup scales roughly linearly: 50% channels removed yields ~50% faster inference.

💡 Key Takeaways

✓Magnitude pruning removes weights closest to zero; for structured, use L1 norm per channel

✓Process: train → rank → prune → fine-tune; accuracy drops 5-15% then recovers to within 1-2%

✓Fine-tuning typically needs 10-20% of original training epochs to recover accuracy

✓Sparse matrices don"t speed up standard GPUs due to gather/scatter overhead exceeding compute savings

✓Structured pruning shrinks matrix dimensions directly; 50% channels removed ≈ 50% faster inference

📌 Interview Tips

1Explain the train-prune-fine-tune pipeline with specific recovery expectations (1-2% accuracy gap)

2Describe why sparse matrices don"t help standard GPUs - shows deep hardware understanding

3Mention L1 norm for channel importance scoring - a practical detail that demonstrates hands-on experience

← Back to Model Pruning (Structured vs Unstructured) Overview