13.1 LoRA
LoRA: Low-Rank Adaptation of Large Language Models
Created Date: 2025-06-07
An important paradigm of natural language processing consists of large-scale pretraining on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example – deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive.
We propose Low-Rank Adaptation, or LoRA, which freezes the pretrained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than finetuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency.
We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.
13.1.1 Introduction
13.1.2 Problem Statement
13.1.3 Aren't Existing Solutions Good Enough?
13.1.4 Our Method
We describe the simple design of LoRA and its practical benefits. The principles outlined here apply to any dense layers in deep learning models, though we only focus on certain weights in Transformer language models in our experiments as the motivating use case.
13.1.4.1 Low-Rank-Parameterized Update Matrices
A neural network contains many dense layers which perform matrix multiplication. The weight matrices in these layers typically have full-rank. When adapting to a specific task, Aghajanyan et al. (2020) shows that the pre-trained language models have a low "instrisic dimension" and can still learn efficiently despite a random projection to a smaller subspace.
Inspired by this, we hypothesize the updates to the weights also have a low "intrinsic rank" during adaptation. For a pre-trained weight matrix \(W_0 \in \mathbb{R}^{d \times k}\), we constrain its update by representing the latter with a low-rank decomposition \(W_0 + \Delta W = W_0 + BA\), where \(B \in \mathbb{R}^{d \times r}, \(A \in \mathbb{R}^{r \times k}\), and the rank \(r \le min(d, k)\).
During training, \(W_0\) is frozen and does not receive gradient updates, while \(A\) and \(B\) contain trainable parameters. Note both \(W_0\) and \(\Delta W = BA\) are multiplied with the same input, and their respective output vectors are summed coordinate-wise. For \(h = W_0 x\), our modified forward pass yields:
We illustrate our reparametrization in Figure 1. We use a random Gaussian initialization for \(A\) and zero for B, so \(\Delta W = B A\) is zero at the beginning of training. We then scale \(\Delta W x\) by \(\frac{\alpha}{r}\), where \(\alpha\) is a constant in \(r\). When optimizing with Adam, tuning \(\alpha\) is roughly the same as tuning the learning rate if we scale the initialization appropriately. As a result, we simply set \(\alpha\) to the first \(r\) we try and do not tune it. This scaling helps to reduce the need to retune hyperparameters when we vary \(r\) (Yang & Hu, 2021).
A Generalization of Full Fine-tuning
A more general form of fine-tuning allows the training of a subset of the pre-trained parameters. LoRA takes a step further and does not require the accumulated gradient update to weight matrices to have full-rank during adaptation.
This means that when applying LoRA to all weight matrices and training all biases, we roughly recover the expressive ness of full fine-tuning by setting the LoRA rank \(r\) to the rank of the pre-trained weight matrices. In other words, as we increase the number of trainable parameters, training LoRA roughly converges to training the original model, while adapter-based methods converges to an MLP and prefix-based methods to a model that cannot take long input sequences.
No Additional Inference Latency
When deployed in production, we can explicitly compute and store \(W = W_0 + BA\) and perform inference as usual. Note that both \(W_0\) and \(BA\) are in \(\mathbb{R}^{d \times k}\). When we need to switch to another downstream task, we can recover \(W_0\) by subtracting \(BA\) and then adding a different \(B'A'\), a quick operation with very little memory overhead. Critically, this guarantees that we do not introduce any additional latency during inference compared to a fine-tuned model by construction.
13.1.4.2 Applying LoRA to Transformer
In principle, we can apply LoRA to any subset of weight matrices in a neural network to reduce the number of trainable parameters. In the Transformer architecture, there are four weight matrices in the self-attention module \((W_q, W_k, W_v, W_o)\) and two in the MLP module. We treat \(W_q\) (or \(W_k\), \(W_v\)) as a single matrix of dimension \(d_{model} \times d_{model}\), even though the output dimension is usually sliced into attention heads.
We limit our study to only adapting the attention weights for downstream tasks and freeze the MLP modules (so they are not trained in downstream tasks) both for simplicity and parameter-efficiency.We further study the effect on adapting different types of attention weight matrices in a Transformer in Section 7.1. We leave the empirical investigation of adapting the MLP layers, LayerNorm layers, and biases to a future work.
Practical Benefits and Limitations
The most significant benefit comes from the reduction in memory and storage usage. For a large Transformer trained with Adam, we reduce that VRAM usage by up to 2/3 if \(r \ll d_{model}\) as we do not need to store the optimizer states for the frozen parameters.
LoRA also has its limitations. For example, it is not straightforward to batch inputs to different tasks with different A and B in a single forward pass, if one chooses to absorb A and B into W to eliminate additional inference latency. Though it is possible to not merge the weights and dynamically choose the LoRA modules to use for samples in a batch for scenarios where latency is not critical.
13.1.5 Empirical Experiments
13.1.6 Related Works
Transformer Language Models
13.1.7 Understanding the Low-Rank Updates
13.1.8 Conclusion and Future Work
Fine-tuning enormous language models is prohibitively expensive in terms of the hardware required and the storage/switching cost for hosting independent instances for different tasks. We propose LoRA, an efficient adaptation strategy that neither introduces inference latency nor reduces input sequence length while retaining high model quality.
Importantly, it allows for quick task-switching when deployed as a service by sharing the vast majority of the model parameters. While we focused on Transformer language models, the proposed principles are generally applicable to any neural networks with dense layers.
There are many directions for future works:
LoRA can be combined with other efficient adaptation methods, potentially providing orthogonal improvement.
The mechanism behind fine-tuning or LoRA is far from clear – how are features learned during pre-training transformed to do well on downstream tasks? We believe that LoRA makes it more tractable to answer this than full finetuning.
We mostly depend on heuristics to select the weight matrices to apply LoRA to. Are there more principled ways to do it?
Finally, the rank-deficiency of \(\Delta W\) suggests that \(W\) could be rank-deficient as well, which can also be a source of inspiration for future works.