Learn→ML Model Optimization→Model Quantization (INT8, FP16, Mixed Precision)→4 of 6

ML Model Optimization • Model Quantization (INT8, FP16, Mixed Precision)Hard⏱️ ~3 min

Weight Only Quantization for Large Language Models

Weight Only Quantization
Large language models (LLMs) with billions of parameters face a unique challenge: most of their size comes from weights, not activations. Weight only quantization compresses just the weights to INT4 or INT8, keeping activations in higher precision. This enables models that would require multiple GPUs to fit on a single GPU.
How Weight Only Works
Storage format: Weights are stored in compressed format (INT4 or INT8). At inference time, weights are decompressed to FP16 on the fly before each matrix multiplication. This adds compute overhead but dramatically reduces memory requirements.
The memory bottleneck: LLM inference is memory bandwidth limited, not compute limited. Loading a 70B parameter model from VRAM takes longer than the actual computation. Smaller weights mean faster loading, which dominates total inference time for large models.
INT4 vs INT8 for LLMs
INT4 (4-bit): 8x compression versus FP32. A 70B model shrinks from 280GB to 35GB. Fits on a single 40GB A100 GPU. Accuracy loss of 1-3% on most tasks but can struggle with reasoning-heavy tasks.
INT8 (8-bit): 4x compression. A 70B model needs 70GB, requiring multiple GPUs or CPU offloading. Accuracy nearly matches FP16. Better for tasks where precision matters.
When Weight Only Quantization Fails
Some weight distributions resist quantization. Outliers (weights 100x larger than average) cause the quantization scale to waste precision on the normal range. Techniques like GPTQ and AWQ handle outliers by grouping weights and using per-group scales.

💡 Key Takeaways

✓Weight only quantization compresses weights to INT4/INT8 while keeping activations in higher precision

✓LLM inference is memory bandwidth limited - smaller weights load faster, dominating total inference time

✓INT4 provides 8x compression: 70B model shrinks from 280GB to 35GB, fits on single 40GB GPU

✓Outlier weights (100x larger than average) cause precision loss - GPTQ/AWQ use per-group scales to handle them

📌 Interview Tips

1Interview Tip: Explain weight only as a memory bandwidth optimization - compute overhead is offset by faster weight loading

2Interview Tip: Mention specific model sizes: INT4 fits 70B on one GPU, INT8 needs multiple GPUs or offloading

← Back to Model Quantization (INT8, FP16, Mixed Precision) Overview