Training · LLMs

LoRA: Low-Rank Adaptation of Large Language Models

The update was low-rank all along.

Fine-tuning learns a change to every weight. LoRA is built on the assumption that the change is low-rank, and trains a low-rank path instead of the whole model, for thousands-fold fewer parameters and no inference cost.

Explaining the paperLoRA: Low-Rank Adaptation of Large Language ModelsHu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen · ICLR 2022 · arXiv:2106.09685 ↗

Adapting a 175-billion-parameter model can come down to training a few hundred thousand numbers.

Say you want a base model good at your support tickets, or your legal corpus, or your codebase. The textbook approach is fine-tuning: keep training, let every weight drift. It works. It also costs a fortune. You are not just storing a 175-billion-parameter model. While you train it you also keep Adam's bookkeeping for every one of those parameters, a running mean and variance, two more numbers each, on top of gradients and activations. The model fits on the GPU. The model plus its training state often does not. And you pay it all over again for the next task, because each task needs its own full copy of the large model.

LoRA (Low-Rank Adaptation, from Microsoft) makes that cost almost disappear, and the idea fits in a sentence. The change fine-tuning needs to make is small. Not small in size. Small in rank: it lives in a low-dimensional corner of the space of possible weight changes. So you don't learn the entire change. You learn a substitute for it, freeze the original weights, and add the substitute back. LoRA stands on three ideas: what fine-tuning actually changes, what a matrix's rank measures, and why a skinny matrix can do the job of a fat one.

Fine-tuning is learning an update

A layer, stripped down, is a matrix multiply: a weight matrix $W_0 \in \mathbb{R}^{d\times k}$ turns an input $x$ into an output $h = W_0 x$ . Fine-tuning nudges that matrix to $W_0 + \Delta W$ , so the layer computes

h = (W_0 + \Delta W)\,x = W_0 x + \Delta W x, \qquad \Delta W \in \mathbb{R}^{d\times k}

(1)

The thing you actually learn is $\Delta W$ , the update. And here is the expense in one line: $\Delta W$ has the same shape as $W_0$ , all $d\times k$ of it. A model is a stack of these, so the update is another full model's worth of numbers to learn, store, and keep optimizer state for. So: does $\Delta W$ really need all those degrees of freedom, or does a much smaller object suffice? Before answering, look at the bill itself, slide the model size and compare what each approach keeps in trainable state:

Figure 1 · what fine-tuning keeps in memory

model size175B params

Memory in multiples of the weight memory (activations not shown). Full fine-tuning holds the weights plus a same-sized gradient tensor plus Adam's two running moments, every parameter trained. LoRA holds the same weights frozen, read-only, no optimizer state, and trains only a slim adapter, drawn here larger than life so it shows at all. The ratio uses the paper's GPT-3 recipe, rank-4 adapters on W_q and W_v.

Rank: what a matrix really holds

A matrix can be big and still carry little. Its rank counts the genuinely independent directions in it: how many of its rows (or columns) you really need, with the rest just combinations of those. A $d\times k$ grid has room for $\min(d,k)$ directions, but a rank- $r$ matrix uses only $r$ of them. Because of that you can write it as a narrow product,

\Delta W = B A, \qquad B \in \mathbb{R}^{d\times r},\ \ A \in \mathbb{R}^{r\times k},\ \ r \ll \min(d,k)

(2)

and storing $B$ and $A$ costs $r(d+k)$ numbers instead of $d\,k$ . When $r$ is small that is a huge difference: a thin $d\times r$ column block and a short $r\times k$ row block, multiplied, reconstitute a full-size $d\times k$ update.

Even when a matrix isn't exactly low-rank, it is usually close. Most of what it does lives in a few dominant directions, and the rest is a long tail of tiny corrections. (The best rank- $r$ stand-in is the truncated SVD, the singular value decomposition, which sorts a matrix into its strongest directions and keeps the top $r$ . That is the Eckart–Young theorem.) Drag $r$ and watch a structured matrix come back from a handful of components:

Figure 2 · low-rank reconstruction

rank rr = 2

params: r(d+k) = 112 vs d·k = 784 (14% · 7.0× fewer) · energy kept 85.2%

Left, a full matrix. Right, its rank-r reconstruction from r components. A few already rebuild almost all of it, at r(d+k) numbers instead of d·k. That gap accounts for all of LoRA's parameter saving.

The update has low intrinsic rank

The empirical claim LoRA rests on is this. The pretrained $W_0$ is full-rank. It uses all its capacity, and LoRA never touches it. The update $\Delta W$ that adapts the model to a task is the cheap part: its intrinsic rank is low, so it moves in only a few directions, even though the matrix it lives in is enormous. Earlier work showed that fine-tuning a language model can be squeezed into a surprisingly small number of dimensions with little loss. LoRA bakes that into the architecture: hold $\Delta W$ to rank $r$ by construction, and learn only $B$ and $A$ . In hindsight there is an intuition for why the change can be small when the weights are not: pretraining had to build general-purpose machinery, and that uses the full width of the matrix. Adapting that machinery to one downstream task mostly means re-aiming a few existing directions rather than building new ones, a surgical tweak, and a tweak that touches only a few directions is by definition a low-rank matrix.

Why is that a good trade? As you raise $r$ , the trainable parameter count climbs in a straight line, one step per rank. The error of the low-rank approximation does not. It falls steeply, because the dominant directions account for most of the reconstruction. So a small $r$ lands in the sweet spot: nearly all of the quality, a sliver of the parameters.

Figure 3 · the rank tradeoff

rank r14% params · 38% err

Trainable parameters rise linearly with r; reconstruction error falls steeply. You want to sit at the elbow, a small r that is already good enough.

LoRA in one line

The pieces now assemble. Freeze $W_0$ . Add a parallel low-rank detour, $A$ projecting the input down to $r$ dimensions and $B$ projecting back up, and sum the two paths:

h = W_0 x + \frac{\alpha}{r}\,B A\,x, \qquad A\ \text{random},\ \ B = \mathbf{0}\ \ (\text{at init})

(3)

$B$ starts at zero, so the path does nothing on the first step. Training begins exactly at the pretrained model, and the adapter only ever adds what it learns. $A$ starts random so gradients have somewhere to flow. The $\alpha/r$ in front is a fixed gain that separates the update's scale from your choice of $r$ , so you can change the rank without re-tuning the learning rate. Only $A$ and $B$ ever get a gradient. The frozen $W_0$ carries no optimizer state at all, and that accounts for most of the memory savings. The released code pins down the rest: $A$ is initialized with Kaiming-uniform random values (a standard init that keeps gradient magnitudes well-scaled; the paper text says Gaussian), $B$ is zeroed, the gain is exactly $\alpha/r$ , and a dropout sits on the path's input.

The shapes make this cheap. $B$ is tall and narrow, $A$ is short and wide, and their product is a full $d \times d$ matrix. Hover a cell of $\Delta W$ and watch which row of $B$ and column of $A$ make it:

Figure 4 · two thin factors make a full matrix

rank rr = 2 · 80 of 400 numbers

Laid out like a multiplication table: A above, B to the left, their product ΔW = BA below. Every cell of ΔW is the dot product of one row of B with one column of A. Drag the rank: at r = 1 the update is a single outer product and visibly striped; raising r adds structure, yet the trained numbers stay 2·d·r against the d² they produce. At GPT-3 width with r = 8 that gap is the page's 770×.

Stripped to essentials, that is the full layer (from the authors' loralib):

# loralib/layers.py: the LoRA layer (simplified)
class LoRALinear(nn.Linear):
    def reset_parameters(self):
        # A: random (paper says Gaussian); B: zero
        nn.init.kaiming_uniform_(self.lora_A)
        nn.init.zeros_(self.lora_B)
        self.scaling = self.lora_alpha / self.r

    def forward(self, x):   # W0 stays frozen
        wx = F.linear(x, self.weight)
        a  = self.lora_dropout(x) @ self.lora_A.T
        ba = a @ self.lora_B.T
        return wx + ba * self.scaling

    def merge(self):   # fold in at inference
        delta = self.lora_B @ self.lora_A
        self.weight.data += delta * self.scaling

Figure 5 · where LoRA plugs in

The pretrained W₀ stays frozen; a trainable low-rank path (down to rank r through A, back up through B) runs in parallel and is added back. Same input x, same output h. Only A and B are learned.

Merge it back: no inference cost

The earlier parameter-efficient idea, adapters, inserted small extra modules in series inside each layer. They trained cheaply but they made the model permanently deeper, so every forward pass paid a latency tax forever. LoRA's low-rank path is in parallel and linear, which means once training is done you can fold it back in:

W = W_0 + \frac{\alpha}{r} B A \quad\Longrightarrow\quad h = W x

(4)

Multiply $B$ and $A$ out, add the result to $W_0$ once, and you are left with an ordinary weight matrix of the original shape. The deployed model is byte-for-byte a normal model running a normal matrix multiply, with zero added latency and zero extra parameters at inference. And because the frozen base never changed, you can keep many tiny LoRA adapters (a few megabytes each) for one shared base model and swap them per task, instead of shipping a full fine-tuned copy every time.

What it costs in practice

GPT-3's hidden width is $d = 12288$ , so a single attention projection $W_q$ is $12288 \times 12288 \approx 1.5\times 10^{8}$ parameters. Full fine-tuning trains every one of those, in every projection, in every layer. A LoRA adapter on that matrix with $r = 8$ trains $8 \times (12288 + 12288) \approx 2\times 10^{5}$ . That is about 770× fewer on one matrix, and the saving across the model is larger still, because LoRA doesn't even adapt most matrices. The paper touches only a couple of the attention projections (adapting $W_q$ and $W_v$ works well) and leaves the rest frozen. The headline numbers for GPT-3 175B: 10,000× fewer trainable parameters and 3× less GPU memory than full fine-tuning with Adam. The memory drop comes straight from not keeping optimizer state for 175 billion frozen weights.

The knobs are few: which matrices to adapt, and the rank $r$ . Ranks as small as 1 or 2 are competitive for the attention projections, and you rarely need more than 8; $\alpha$ you set once. That small a search space is part of why LoRA stuck.

Matching full fine-tuning

Across a spread of language models (RoBERTa, DeBERTa, GPT-2, and GPT-3), LoRA performs on par with full fine-tuning, and sometimes beats it, while training a tiny fraction of the parameters and adding nothing to inference. That a rank as low as 1 or 2 keeps up is the evidence for the hypothesis: if a one-dimensional update can carry a task, the change really was low-rank.

What it does not remove are two choices and a dependency. You still pick which matrices to adapt and what rank to use, and a task very far from the base model's training may need a larger $r$ . Merging is one-way per deployment: a merged model is fast but task-locked, so live multi-task serving keeps the adapters separate and pays a small cost to add the low-rank path back. And LoRA only adapts an existing model; it leans on a strong pretrained $W_0$ already being there. None of that has dented it. Freezing the giant and training a sliver is now the default way to specialize large models, and the follow-ups (compress the frozen base to 4-bit numbers to shrink it in memory, then train LoRA on top, plus the rest of the parameter-efficient fine-tuning family) all build on the same observation. The update was low-rank all along.

Provenance Verified against primary literature

LoRA (2021)Hu et al.: the method, the α/r scaling, the B = 0 initialization, and the GPT-3 results.

loralib (code)Official implementation, layers.py:100–120. Kaiming-uniform A, zero B, scaling = α/r, and the merge-on-eval that removes inference latency.

Intrinsic dimension (2020)Aghajanyan et al.: fine-tuning updates live in a low-dimensional subspace.

Adapters (2019)Houlsby et al.: the bottleneck-module precursor whose inference latency LoRA removes.

Eckart–Young–MirskyThe best rank-r approximation of a matrix is its truncated SVD.

correctionThe paper text says A is initialized with a random Gaussian; the released code (loralib/layers.py) initializes A with Kaiming-uniform and B with zeros. We follow the code.

Questions you might still have

If W₀ is full-rank, how can a low-rank ΔW be enough?
You are not approximating W₀. It keeps all of its capacity, frozen. You are approximating the small change needed to adapt it, and that change empirically lives in a few directions.

Why initialize B to zero?
So ΔW = BA = 0 at the start: training begins exactly at the pretrained model and the low-rank path only ever adds what it has learned, which keeps early steps stable. A is random so gradients can still flow.

Where does α/r come from, and how do I set it?
It rescales the update so changing r does not force you to re-tune the learning rate. In practice you fix α once and treat α/r as a constant gain, then sweep r on its own.

Footnotes & further reading

The paper: Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen, LoRA: Low-Rank Adaptation of Large Language Models (Microsoft, ICLR 2022). Code.
The low-intrinsic-dimension finding LoRA builds on: Aghajanyan, Zettlemoyer, Gupta, Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning.
The adapter precursor: Houlsby et al., Parameter-Efficient Transfer Learning for NLP.
That the truncated SVD is the best low-rank approximation: the Eckart–Young–Mirsky theorem.
The most-used follow-up: Dettmers et al., QLoRA, LoRA on a 4-bit quantized frozen base.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.