Weight Only Quantization for Large Language Models
Weight Only Quantization
Large language models (LLMs) with billions of parameters face a unique challenge: most of their size comes from weights, not activations. Weight only quantization compresses just the weights to INT4 or INT8, keeping activations in higher precision. This enables models that would require multiple GPUs to fit on a single GPU.
How Weight Only Works
Storage format: Weights are stored in compressed format (INT4 or INT8). At inference time, weights are decompressed to FP16 on the fly before each matrix multiplication. This adds compute overhead but dramatically reduces memory requirements.
The memory bottleneck: LLM inference is memory bandwidth limited, not compute limited. Loading a 70B parameter model from VRAM takes longer than the actual computation. Smaller weights mean faster loading, which dominates total inference time for large models.
INT4 vs INT8 for LLMs
INT4 (4-bit): 8x compression versus FP32. A 70B model shrinks from 280GB to 35GB. Fits on a single 40GB A100 GPU. Accuracy loss of 1-3% on most tasks but can struggle with reasoning-heavy tasks.
INT8 (8-bit): 4x compression. A 70B model needs 70GB, requiring multiple GPUs or CPU offloading. Accuracy nearly matches FP16. Better for tasks where precision matters.
When Weight Only Quantization Fails
Some weight distributions resist quantization. Outliers (weights 100x larger than average) cause the quantization scale to waste precision on the normal range. Techniques like GPTQ and AWQ handle outliers by grouping weights and using per-group scales.