VerifiedarXiv:2106.0968518 min
Training · LLMs

LoRA: Low-Rank Adaptation of Large Language Models

The update was low-rank all along.

Fine-tuning learns a change to every weight. LoRA bets that change is skinny, and trains a thin detour instead of the whole model, for thousands-fold fewer parameters and no inference cost.

Explaining the paperLoRA: Low-Rank Adaptation of Large Language ModelsHu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen · ICLR 2022 · arXiv:2106.09685

What if adapting a 175-billion-parameter model only meant training a few hundred thousand numbers?

Say you want a base model good at your support tickets, or your legal corpus, or your codebase. The textbook move is fine-tuning: keep training, let every weight drift. It works. It also costs a fortune. You are not just storing a 175-billion-parameter model. While you train it you also keep Adam's bookkeeping for every one of those parameters, a running mean and variance, two more numbers each, on top of gradients and activations. The model fits on the GPU. The model plus its training state often does not. And you pay it all over again for the next task, because each one wants its own full copy of the giant.

LoRA (Low-Rank Adaptation, from Microsoft) makes that cost almost disappear, and the idea fits in a sentence. The change fine-tuning needs to make is small. Not small in size. Small in rank: it lives in a low-dimensional corner of the space of possible weight changes. So you don't learn the whole change. You learn a thin stand-in for it, freeze the original weights, and add the stand-in back. Three ideas make that click, and we build them in order: what fine-tuning actually changes, what a matrix's rank measures, and why a skinny matrix can do the job of a fat one.

Fine-tuning is learning an update

Strip a layer down and it is a matrix multiply: a weight matrix W0Rd×kW_0 \in \mathbb{R}^{d\times k} turns an input xx into an output h=W0xh = W_0 x. Fine-tuning nudges that matrix to W0+ΔWW_0 + \Delta W, so the layer computes

h=(W0+ΔW)x=W0x+ΔWx,ΔWRd×kh = (W_0 + \Delta W)\,x = W_0 x + \Delta W x, \qquad \Delta W \in \mathbb{R}^{d\times k}(1)

The thing you actually learn is ΔW\Delta W, the update. And here is the expense in one line: ΔW\Delta W has the same shape as W0W_0, all d×kd\times k of it. A model is a stack of these, so the update is another whole model's worth of numbers to learn, store, and keep optimizer state for. The question LoRA asks is simple: does ΔW\Delta W really need all those degrees of freedom, or is it hiding a much smaller object?

Rank: what a matrix really holds

A matrix can be big and still carry little. Its rank counts the genuinely independent directions in it: how many of its rows (or columns) you really need, with the rest just combinations of those. A d×kd\times k grid has room for min(d,k)\min(d,k) directions, but a rank-rr matrix uses only rr of them. That is what lets you write it as a skinny product,

ΔW=BA,BRd×r,  ARr×k,  rmin(d,k)\Delta W = B A, \qquad B \in \mathbb{R}^{d\times r},\ \ A \in \mathbb{R}^{r\times k},\ \ r \ll \min(d,k)(2)

and storing BB and AA costs r(d+k)r(d+k) numbers instead of dkd\,k. When rr is small that is a huge difference: a thin d×rd\times r column block and a short r×kr\times k row block, multiplied, reconstitute a full-size d×kd\times k update.

Even when a matrix isn't exactly low-rank, it is usually close. Most of what it does lives in a few dominant directions, and the rest is a long tail of tiny corrections. (The best rank-rr stand-in is the truncated SVD. That is the Eckart–Young theorem.) Drag rr and watch a structured matrix come back from a handful of components:

Figure 1 · low-rank reconstruction
r = 2
params: r(d+k) = 112 vs d·k = 784 (14% · 7.0× fewer) · energy kept 85.2%
Left, a full matrix. Right, its rank-r reconstruction from r components. A few already rebuild almost all of it, at r(d+k) numbers instead of d·k. That gap is LoRA's whole savings.

The update is secretly low-rank

Now the bet, and it is an empirical one. The pretrained W0W_0 is full-rank. It uses all its capacity, and LoRA never touches it. The update ΔW\Delta W that adapts the model to a task is the part that turns out cheap: its intrinsic rank is low, so the directions it needs to move in are few, even though the matrix it lives in is enormous. Nobody guessed this. Earlier work showed that fine-tuning a language model can be squeezed into a surprisingly small number of dimensions with little loss. LoRA bakes that into the architecture: hold ΔW\Delta W to rank rr by construction, and learn only BB and AA.

Why is that a good trade? Look at the curve. As you raise rr, the trainable parameter count climbs in a straight line, one step per rank. The error of the low-rank stand-in does not. It drops off a cliff, because the dominant directions carry most of the weight. So a small rr lands in the sweet spot: nearly all of the quality, a sliver of the parameters.

Figure 2 · the rank tradeoff
14% params · 38% err
Trainable parameters rise linearly with r; reconstruction error falls steeply. The whole game is to live at the elbow, a small r that is already good enough.

LoRA in one line

Now assemble it. Freeze W0W_0. Add a parallel low-rank detour, AA projecting the input down to rr dimensions and BB projecting back up, and sum the two paths:

h=W0x+αrBAx,A random,  B=0  (at init)h = W_0 x + \frac{\alpha}{r}\,B A\,x, \qquad A\ \text{random},\ \ B = \mathbf{0}\ \ (\text{at init})(3)

Three details make it work. BB starts at zero, so the detour does nothing on the first step. Training begins exactly at the pretrained model, and the adapter only ever adds what it learns. AA starts random so gradients have somewhere to flow. The α/r\alpha/r in front is a fixed gain that separates the update's scale from your choice of rr, so you can change the rank without re-tuning the learning rate. Only AA and BB ever get a gradient. The frozen W0W_0 carries no optimizer state at all, which is where most of the memory savings come from. The released code pins down the rest: AA is initialized Kaiming-uniform (the paper text says Gaussian), BB is zeroed, the gain is exactly α/r\alpha/r, and a dropout sits on the detour's input.

Stripped to essentials, that is the whole layer (from the authors' loralib):

# loralib/layers.py: the LoRA layer (simplified)
class LoRALinear(nn.Linear):
    def reset_parameters(self):
        # A: random (paper says Gaussian); B: zero
        nn.init.kaiming_uniform_(self.lora_A)
        nn.init.zeros_(self.lora_B)
        self.scaling = self.lora_alpha / self.r

    def forward(self, x):   # W0 stays frozen
        wx = F.linear(x, self.weight)
        a  = self.lora_dropout(x) @ self.lora_A.T
        ba = a @ self.lora_B.T
        return wx + ba * self.scaling

    def merge(self):   # fold in at inference
        delta = self.lora_B @ self.lora_A
        self.weight.data += delta * self.scaling
Figure 3 · where LoRA plugs in
The pretrained W₀ stays frozen; a thin trainable detour (down to rank r through A, back up through B) runs in parallel and is added back. Same input x, same output h. Only A and B are learned.

The free lunch: no inference tax

The earlier parameter-efficient idea, adapters, inserted small extra modules in series inside each layer. They trained cheaply but they made the model permanently deeper, so every forward pass paid a latency tax forever. LoRA's detour is in parallel and linear, which means once training is done you can fold it back in:

W=W0+αrBAh=WxW = W_0 + \frac{\alpha}{r} B A \quad\Longrightarrow\quad h = W x(4)

Multiply BB and AA out, add the result to W0W_0 once, and you are left with an ordinary weight matrix of the original shape. The deployed model is byte-for-byte a normal model running a normal matrix multiply, with zero added latency and zero extra parameters at inference. And because the frozen base never changed, you can keep a whole wardrobe of tiny LoRA "skins" (a few megabytes each) for one shared giant and swap them per task, instead of shipping a full fine-tuned copy every time.

What it costs in practice

Make it concrete. GPT-3's hidden width is d=12288d = 12288, so a single attention projection WqW_q is 12288×122881.5×10812288 \times 12288 \approx 1.5\times 10^{8} parameters. Full fine-tuning trains every one of those, in every projection, in every layer. A LoRA adapter on that matrix with r=8r = 8 trains 8×(12288+12288)2×1058 \times (12288 + 12288) \approx 2\times 10^{5}. That is about 770× fewer on one matrix, and the whole-model saving is larger still, because LoRA doesn't even adapt most matrices. The paper touches only a couple of the attention projections (adapting WqW_q and WvW_v works well) and leaves the rest frozen. The headline numbers for GPT-3 175B: 10,000× fewer trainable parameters and 3× less GPU memory than full fine-tuning with Adam. The memory drop comes straight from not keeping optimizer state for 175 billion frozen weights.

The knobs are few: which matrices to adapt, and the rank rr. Ranks as small as 1 or 2 are competitive for the attention projections, and you rarely need more than 8; α\alpha you set once. That small a search space is part of why LoRA stuck.

So what does it actually do

It matches the expensive thing. Across RoBERTa, DeBERTa, GPT-2, and GPT-3, LoRA performs on par with full fine-tuning, and sometimes beats it, while training a tiny fraction of the parameters and adding nothing to inference. That a rank as low as 1 or 2 keeps up is the evidence for the hypothesis: if a one-dimensional update can carry a task, the change really was low-rank.

The limits are honest ones. You still choose which matrices to adapt and what rank to use, and a task very far from the base model's training may want a larger rr. Merging is one-way per deployment: a merged model is fast but task-locked, so live multi-task serving keeps the adapters separate and pays a small cost to add the detour back. And LoRA only adapts an existing model; it leans on a strong pretrained W0W_0 already being there. None of that has dented it. Freezing the giant and training a sliver is now the default way to specialize large models, and the follow-ups (quantize the frozen base to 4-bit and LoRA on top, plus the rest of the parameter-efficient family) all build on the same observation. The update was low-rank all along. We just stopped paying for the part that wasn't there.

Provenance Verified against primary literature
LoRA (2021)Hu et al.: the method, the α/r scaling, the B = 0 initialization, and the GPT-3 results.
loralib (code)Official implementation, layers.py:100–120. Kaiming-uniform A, zero B, scaling = α/r, and the merge-on-eval that removes inference latency.
Intrinsic dimension (2020)Aghajanyan et al.: fine-tuning updates live in a low-dimensional subspace.
Adapters (2019)Houlsby et al.: the bottleneck-module precursor whose inference latency LoRA removes.
Eckart–Young–MirskyThe best rank-r approximation of a matrix is its truncated SVD.
correctionThe paper text says A is initialized with a random Gaussian; the released code (loralib/layers.py) initializes A with Kaiming-uniform and B with zeros. We follow the code.

Questions you might still have

?

If W₀ is full-rank, how can a low-rank ΔW be enough?
You are not approximating W₀. It keeps all of its capacity, frozen. You are approximating the small change needed to adapt it, and that change empirically lives in a few directions.

?

Why initialize B to zero?
So ΔW = BA = 0 at the start: training begins exactly at the pretrained model and the detour only ever adds what it has learned, which keeps early steps stable. A is random so gradients can still flow.

?

Where does α/r come from, and how do I set it?
It rescales the update so changing r does not force you to re-tune the learning rate. In practice you fix α once and treat α/r as a constant gain, then sweep r on its own.

Footnotes & further reading

  1. The paper: Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen, LoRA: Low-Rank Adaptation of Large Language Models (Microsoft, ICLR 2022). Code.
  2. The low-intrinsic-dimension finding LoRA builds on: Aghajanyan, Zettlemoyer, Gupta, Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning.
  3. The adapter precursor: Houlsby et al., Parameter-Efficient Transfer Learning for NLP.
  4. That the truncated SVD is the best low-rank approximation: the Eckart–Young–Mirsky theorem.
  5. The most-used follow-up: Dettmers et al., QLoRA, LoRA on a 4-bit quantized frozen base.