Loading...
LLM & Generative AI SystemsFine-tuning at Scale (LoRA, QLoRA, PEFT)Medium⏱️ ~3 min

How LoRA Works: Low Rank Adaptation Mechanics

The Core Insight: Low Rank Adaptation (LoRA) is based on a mathematical observation: when fine tuning a pre trained model, the optimal weight updates often lie in a low rank subspace. Instead of learning a full dense update matrix ΔW with dimensions m by n (which could be millions of parameters), LoRA decomposes this into two much smaller matrices. The Mathematical Trick: For each weight matrix W in the model (typically attention projections), LoRA introduces two new matrices A and B. Matrix A has dimensions m by r, and matrix B has dimensions r by n, where r is the rank and is much smaller than both m and n. The update becomes ΔW = A × B. Consider a concrete example: an attention weight matrix might be 4096 by 4096, containing roughly 16.8 million parameters. With LoRA at rank r = 8, you instead learn matrix A of size 4096 by 8 and matrix B of size 8 by 4096. That's only 32,768 plus 32,768 equals 65,536 parameters, a reduction from 16.8 million to 65 thousand, which is 256 times fewer parameters.
Frozen Base Weight W
4096 × 4096 = 16.8M params
A
4096 × 8
×
B
8 × 4096
Output = W·x + A·(B·x)
Only 65K trainable params
Training Mechanics: During training, the forward pass computes W times x plus A times (B times x) for each adapted layer. The base weight W stays completely frozen, so gradients only flow through A and B. This means optimizer states (momentum, variance for Adam) are stored only for these small matrices. For a 7B parameter model with LoRA adapters totaling 50M parameters, optimizer states might be 200 MB instead of 84 GB. Which Layers Get Adapted: Practitioners typically target specific linear projections in the Transformer architecture. Common choices include the query, key, value, and output projections in multi head attention. Some implementations extend to feed forward layers: the up projection, gate projection, and down projection. Databricks experiments found that targeting all linear layers with rank 8 yielded better results than using rank 16 on attention only, suggesting that layer coverage matters more than rank size up to a point.
✓ In Practice: Matrix A is often initialized to zero or very small values, and matrix B to random values, so that the initial product A times B starts near zero. This ensures the adapted model begins with the same behavior as the frozen base, then gradually learns task specific adjustments.
Deployment Options: At inference time, you have two choices. You can merge the adapters by computing W prime = W + A times B offline, saving a single merged checkpoint per task. Inference then uses this merged weight with no extra computation. Alternatively, you can keep adapters separate and apply them dynamically: compute W times x plus A times (B times x) at runtime. The separate approach adds a small overhead from extra matrix multiplies (typically 3 to 7% latency increase), but enables loading different adapters without duplicating the base model in GPU memory.
💡 Key Takeaways
LoRA decomposes weight updates ΔW into two low rank matrices A and B, reducing a 4096 by 4096 update (16.8M params) to 65K params at rank 8 (256x reduction)
Common ranks range from 4 to 64; experiments show that targeting more layers (all linear projections) with rank 8 often outperforms targeting fewer layers (attention only) with rank 16
During training, base weights W remain frozen so optimizer states shrink dramatically: 200 MB for adapters versus 84 GB for full 7B model fine tuning
Two deployment modes: merged adapters (W + A×B offline) give lowest latency, separate adapters (dynamic application) enable multi tenant serving with 3 to 7% latency overhead
📌 Examples
1A 7B model attention layer (4096 dim) with LoRA rank 8: trains 65K params per layer instead of 16.8M, reducing memory by 99.6%
2Targeting 32 attention layers plus 32 feed forward layers in a 7B model with rank 8 yields roughly 50M trainable parameters total (0.7% of base model)
3Inference with merged LoRA adapters: 50 to 150 ms p50 latency on A100 for short responses, identical to base quantized model performance
← Back to Fine-tuning at Scale (LoRA, QLoRA, PEFT) Overview
Loading...