ML Model OptimizationModel Pruning (Structured vs Unstructured)Medium⏱️ ~3 min

Hardware Efficiency and Speedup Characteristics

Core Concept
Magnitude pruning is the simplest and most common technique: remove weights closest to zero. The intuition: small weights contribute little to outputs, so removing them minimally affects predictions.

How Magnitude Pruning Works

After training, rank all weights by absolute value. Set the smallest X% to zero permanently. For structured pruning, compute an importance score per channel (L1 norm of channel weights is common) and remove lowest-scoring channels. The process: train → rank → prune → fine-tune. Fine-tuning is essential; accuracy drops 5-15% immediately after pruning but recovers to within 1-2% of original after 10-20% of original training epochs.

Why GPUs Don"t Speed Up Sparse Matrices

GPUs achieve speed through parallelism: computing thousands of multiply-adds simultaneously. Sparse matrices break this pattern. To multiply a sparse matrix, the GPU must: identify non-zero positions, gather those values, compute products, scatter results back. This coordination overhead often exceeds the compute savings. A 90% sparse matrix multiplication can run slower than dense. Only specialized sparse tensor cores (available on newer hardware) handle sparsity efficiently.

Structured Pruning Speedups

Removing channels physically reduces matrix dimensions. A 256-channel layer pruned to 128 channels uses matrices half the size. Standard GEMM operations run faster with smaller matrices, no special hardware needed. Speedup scales roughly linearly: 50% channels removed yields ~50% faster inference.

💡 Key Takeaways
Magnitude pruning removes weights closest to zero; for structured, use L1 norm per channel
Process: train → rank → prune → fine-tune; accuracy drops 5-15% then recovers to within 1-2%
Fine-tuning typically needs 10-20% of original training epochs to recover accuracy
Sparse matrices don"t speed up standard GPUs due to gather/scatter overhead exceeding compute savings
Structured pruning shrinks matrix dimensions directly; 50% channels removed ≈ 50% faster inference
📌 Interview Tips
1Explain the train-prune-fine-tune pipeline with specific recovery expectations (1-2% accuracy gap)
2Describe why sparse matrices don"t help standard GPUs - shows deep hardware understanding
3Mention L1 norm for channel importance scoring - a practical detail that demonstrates hands-on experience
← Back to Model Pruning (Structured vs Unstructured) Overview