# LoRA (Low-Rank Adaptation)
Parent: [[Low-Rank Decomposition & Matrix Factorisation]]
Freeze the base model. Inject a tiny trainable delta. Done.
The precise move: for a target weight matrix W₀ in the base model, don't update W₀ directly during fine-tuning. Instead, learn a low-rank correction ΔW = BA, where B is d×r and A is r×d and r is typically 4 to 64. At inference, you use W₀ + BA. The base model is untouched; all you store per task is the tiny A and B.
This is the architectural move that made the "one base model, hundreds of specialists" pattern practical. A 7B base model takes tens of gigabytes. A single LoRA adapter at r=8 takes a few megabytes. You can store thousands of them on a laptop and swap between them at inference.
Why it works: the updates required to adapt a pre-trained model to a new task are intrinsically low-rank, even though the base weights are not. You are not rebuilding the model from scratch — you are steering it. Steering, it turns out, is a low-dimensional operation.
The knobs worth knowing: rank r (accuracy vs. size), target modules (usually attention projections, sometimes FFN), and scaling α (controls how hard the adapter pushes against the base). QLoRA extends the idea by quantising the base model to 4-bit while keeping LoRA updates in full precision, letting you fine-tune 65B models on a single consumer GPU.
## Related
- [[Low-Rank Decomposition]]
- [[Parameter-Efficient Fine-Tuning (PEFT)]]
- [[Mixture of Experts (MoE)]]
- [[Multi-Tenant Model Serving]]
---
Tags: #ai #peft #lora #kp