Hardware Efficiency and Speedup Characteristics
How Magnitude Pruning Works
After training, rank all weights by absolute value. Set the smallest X% to zero permanently. For structured pruning, compute an importance score per channel (L1 norm of channel weights is common) and remove lowest-scoring channels. The process: train → rank → prune → fine-tune. Fine-tuning is essential; accuracy drops 5-15% immediately after pruning but recovers to within 1-2% of original after 10-20% of original training epochs.
Why GPUs Don"t Speed Up Sparse Matrices
GPUs achieve speed through parallelism: computing thousands of multiply-adds simultaneously. Sparse matrices break this pattern. To multiply a sparse matrix, the GPU must: identify non-zero positions, gather those values, compute products, scatter results back. This coordination overhead often exceeds the compute savings. A 90% sparse matrix multiplication can run slower than dense. Only specialized sparse tensor cores (available on newer hardware) handle sparsity efficiently.
Structured Pruning Speedups
Removing channels physically reduces matrix dimensions. A 256-channel layer pruned to 128 channels uses matrices half the size. Standard GEMM operations run faster with smaller matrices, no special hardware needed. Speedup scales roughly linearly: 50% channels removed yields ~50% faster inference.