ML Model Optimization • Model Quantization (INT8, FP16, Mixed Precision)Hard⏱️ ~3 min
Weight Only Quantization for Large Language Models
Weight only quantization compresses model weights to INT8 or INT4 while keeping activations in FP16 or BF16. This asymmetric approach is critical for Large Language Models (LLMs) where activation outliers in attention layers make full INT8 quantization degrade quality by 10 to 20 percent. By quantizing only weights, you achieve 4 to 8 times memory reduction with minimal accuracy loss, enabling deployment of 65B parameter models on single GPUs that would otherwise require multi GPU setups.
Activation aware weight quantization protects channels with high activation magnitudes by assigning them finer quantization scales. During calibration, run the model on representative data, compute per channel activation magnitudes, then scale quantization ranges inversely to these magnitudes. Channels that produce large activations get smaller quantization bins to preserve precision where it matters most. This technique reduces perplexity degradation from 15 percent with naive per tensor quantization to 2 percent with activation aware per channel quantization.
QLoRA extends this by quantizing base model weights to 4 bit NF4 (Normal Float 4) format with double quantization of the scale factors themselves, then trains small Low Rank Adaptation (LoRA) modules in FP16. A 65B parameter model compressed to 4 bits occupies roughly 33GB versus 260GB in FP32, fitting on a single A100 80GB GPU with room for gradients and optimizer states for the LoRA parameters. This enables fine tuning at 1/8 the memory cost with quality matching full precision fine tuning.
💡 Key Takeaways
•Weight only quantization to INT8 achieves 4 times memory reduction for LLMs while avoiding 10 to 20 percent perplexity increase from quantizing activation outliers in transformer attention layers
•Activation aware weight quantization assigns per channel scales inversely proportional to activation magnitudes, reducing perplexity degradation from 15 percent with naive quantization to 2 percent
•QLoRA with 4 bit NF4 weights and double quantization enables fine tuning 65B models on single 80GB A100, fitting in 33GB versus 260GB for FP32, with quality matching full precision
•INT4 weight formats like NF4 use non uniform binning optimized for normal distributions typical in neural network weights, improving representation efficiency versus uniform INT4
•Runtime dequantization overhead for weight only quantization is typically 10 to 15 percent of total latency, offset by 2 to 3 times throughput gain from memory bandwidth savings on large models
•Per channel quantization for weights is critical: transformer feed forward layers show 30 to 50 percent variance in weight magnitudes across output channels, making per tensor quantization wasteful
📌 Examples
Meta LLaMA 70B: INT8 weight only quantization reduces model size from 280GB to 70GB, enabling inference on single 8xA100 node versus 2 nodes, cutting serving cost by 40 percent with 1.5 percent perplexity increase
Google PaLM 540B: Activation aware INT4 weights with FP16 activations achieve 8x compression, 135GB model size, maintaining 98 percent of baseline quality on few shot tasks
QLoRA fine tuning: Guanaco 65B chatbot fine tuned with 4 bit base weights and 16 bit LoRA adapters in 48 hours on single A100, achieving 99 percent of full precision fine tuned quality on Vicuna benchmark