LoRA (Low-Rank Adaptation)

# LoRA (Low-Rank Adaptation) Parent: [[Low-Rank Decomposition & Matrix Factorisation]] Freeze the base model. Inject a tiny trainable delta. Done. The precise move: for a target weight matrix W₀ in the base model, don't update W₀ directly during fine-tuning. Instead, learn a low-rank correction ΔW = BA, where B is d×r and A is r×d and r is typically 4 to 64. At inference, you use W₀ + BA. The base model is untouched; all you store per task is the tiny A and B. This is the architectural move that made the "one base model, hundreds of specialists" pattern practical. A 7B base model takes tens of gigabytes. A single LoRA adapter at r=8 takes a few megabytes. You can store thousands of them on a laptop and swap between them at inference. Why it works: the updates required to adapt a pre-trained model to a new task are intrinsically low-rank, even though the base weights are not. You are not rebuilding the model from scratch — you are steering it. Steering, it turns out, is a low-dimensional operation. The knobs worth knowing: rank r (accuracy vs. size), target modules (usually attention projections, sometimes FFN), and scaling α (controls how hard the adapter pushes against the base). QLoRA extends the idea by quantising the base model to 4-bit while keeping LoRA updates in full precision, letting you fine-tune 65B models on a single consumer GPU. ## Related - [[Low-Rank Decomposition]] - [[Parameter-Efficient Fine-Tuning (PEFT)]] - [[Mixture of Experts (MoE)]] - [[Multi-Tenant Model Serving]] --- Tags: #ai #peft #lora #kp