ML Model OptimizationModel Quantization (INT8, FP16, Mixed Precision)Hard⏱️ ~3 min

Weight Only Quantization for Large Language Models

Weight Only Quantization

Large language models (LLMs) with billions of parameters face a unique challenge: most of their size comes from weights, not activations. Weight only quantization compresses just the weights to INT4 or INT8, keeping activations in higher precision. This enables models that would require multiple GPUs to fit on a single GPU.

How Weight Only Works

Storage format: Weights are stored in compressed format (INT4 or INT8). At inference time, weights are decompressed to FP16 on the fly before each matrix multiplication. This adds compute overhead but dramatically reduces memory requirements.

The memory bottleneck: LLM inference is memory bandwidth limited, not compute limited. Loading a 70B parameter model from VRAM takes longer than the actual computation. Smaller weights mean faster loading, which dominates total inference time for large models.

INT4 vs INT8 for LLMs

INT4 (4-bit): 8x compression versus FP32. A 70B model shrinks from 280GB to 35GB. Fits on a single 40GB A100 GPU. Accuracy loss of 1-3% on most tasks but can struggle with reasoning-heavy tasks.

INT8 (8-bit): 4x compression. A 70B model needs 70GB, requiring multiple GPUs or CPU offloading. Accuracy nearly matches FP16. Better for tasks where precision matters.

When Weight Only Quantization Fails

Some weight distributions resist quantization. Outliers (weights 100x larger than average) cause the quantization scale to waste precision on the normal range. Techniques like GPTQ and AWQ handle outliers by grouping weights and using per-group scales.

💡 Key Takeaways
Weight only quantization compresses weights to INT4/INT8 while keeping activations in higher precision
LLM inference is memory bandwidth limited - smaller weights load faster, dominating total inference time
INT4 provides 8x compression: 70B model shrinks from 280GB to 35GB, fits on single 40GB GPU
Outlier weights (100x larger than average) cause precision loss - GPTQ/AWQ use per-group scales to handle them
📌 Interview Tips
1Interview Tip: Explain weight only as a memory bandwidth optimization - compute overhead is offset by faster weight loading
2Interview Tip: Mention specific model sizes: INT4 fits 70B on one GPU, INT8 needs multiple GPUs or offloading
← Back to Model Quantization (INT8, FP16, Mixed Precision) Overview
Weight Only Quantization for Large Language Models | Model Quantization (INT8, FP16, Mixed Precision) - System Overflow