Parameter-Efficient Fine-Tuning (PEFT)

# Parameter-Efficient Fine-Tuning (PEFT) Parent: [[Mixture of Experts & Adapter Architectures]] The umbrella term for methods that adapt a pre-trained model to a new task while updating only a tiny fraction of its parameters — usually less than 1%. The frozen base does the heavy lifting. A small trainable surface does the steering. PEFT emerged from a simple pragmatic problem. A 7B model has 14GB of weights in bf16. Full fine-tuning requires three copies in memory (weights, gradients, optimiser state), plus activations — easily 100GB+. Most teams do not have that hardware. More importantly, most tasks do not need it. The information in a domain-specific dataset is small; the adaptation surface should be small too. Four main families dominate. LoRA and its variants inject low-rank deltas into attention projections. Adapters insert small bottleneck MLPs between existing layers. Prefix and prompt tuning prepend trainable "soft prompts" into the input sequence. IA3 scales intermediate activations with learned vectors. All of them reduce the trainable parameter count by orders of magnitude while retaining most of the quality of full fine-tuning on narrow tasks. The tradeoffs across the family are subtle. LoRA is the most general and the most widely deployed. Prefix tuning struggles on smaller base models. Adapters add inference latency because they introduce sequential bottlenecks. IA3 is extremely parameter-efficient but less expressive. For most production use cases, LoRA is the default and the others are optimisations at the margin. ## Related - [[LoRA (Low-Rank Adaptation)]] - [[Mixture of Experts (MoE)]] - [[Multi-Tenant Model Serving]] --- Tags: #ai #peft #kp