Training · Systems

Training Deep Nets with Sublinear Memory Cost

Trade one extra forward pass for square-root memory.

Most of what training keeps in GPU memory, it keeps only to use once, much later. Gradient checkpointing throws most of it away and recomputes it on demand, dropping the memory cost of an $n$ -layer network from $O(n)$ to $O(\sqrt{n})$ for the price of a single extra forward pass.

Explaining the paperTraining Deep Nets with Sublinear Memory CostChen, Xu, Zhang, Guestrin · U. Washington / MIT / Dato · 2016 · arXiv:1604.06174 ↗

A 1,000-layer network needs 48 GB of memory to train. Checkpointing fits it in 7 GB.

Backpropagation makes you remember everything, and that sets how big a model you can train. Every layer's output activation is held in memory from the moment the forward pass computes it until the backward pass comes back around, much later, to use it. Train a 100-layer model and you pay for 100 layers' worth of activations at once. This requirement comes from the chain rule. To get the gradient at a layer you need that layer's own forward values: a stored activation is the price of being able to compute its gradient later. So memory grows linearly with depth, and that wall, not arithmetic and not data, usually sets how deep a network fits on a GPU. The paper opens on exactly this pain: a 1,000-layer residual network that needs 48 GB it doesn't have.

Those stored activations are scratch work, though. The forward pass used each one to compute the next and then left it sitting idle for the rest of the run, alive in memory but untouched, waiting for one brief moment when the backward pass needs it. Keeping it around is convenient, not necessary. If an activation is cheap to rebuild, you could throw it away and recompute it the instant the backward pass needs it. The method the paper describes is exactly this trade, a little recomputation in exchange for a lot of memory: keep a few checkpoints, drop the rest, and rebuild the dropped ones on the way back. Done right it takes the memory bill from $n$ down to $\sqrt{n}$ , at the cost of running the forward pass roughly one extra time.

The technique now has a name, gradient checkpointing, and it is everywhere a large model is trained. Two ideas stack to get there. First, treat memory the way a compiler treats registers and reuse it aggressively, which is a constant-factor win. Second, drop activations and recompute them, which is the change that beats the linear trend. We'll take them in that order.

The memory wall

First, why is training so much hungrier than inference? At inference you run the forward pass and discard each activation as soon as the next layer has consumed it, so a network of any depth can run in nearly constant memory. Training cannot do that, because of the chain rule.

To update layer $\ell$ , backprop needs two things: the gradient flowing back from above, and the layer's own local derivative. That local derivative has to be evaluated at the actual values the forward pass produced. A matmul's weight gradient, for one, is the upstream gradient times the layer's input, so you need that input — and the output, already multiplied and summed, does not encode it. (A few layers get off easy: a ReLU's local derivative is just a one-bit mask of which inputs were positive, and a positive output had a positive input, so that mask does survive in the output. The dense layers' inputs do not.) Those inputs are not recoverable from the output alone, so the framework keeps every layer's forward activation alive. The consequence is the picture below: at the moment the backward pass begins, all $n$ activations are in memory at once. That simultaneous peak is the memory bottleneck, and it scales linearly with depth.

Figure 1 · the memory wall

Plain backprop. The forward sweep stores every layer's activation, filling the naive meter to 100% and holding it there until the backward sweep drains it. In-place and sharing optimizations, described below, recycle a constant fraction, a 2–3× saving that is still linear in depth.

That linearity is the problem. It means the deepest network you can train is set by a division: GPU memory divided by the memory per layer. Buy a bigger GPU and you push the wall out by a constant; you do not move it.

Cheap wins: the computation graph

Before recomputing anything, the paper first collects the easy savings available from how memory is allocated. A network is better viewed not as code but as a computation graph: nodes are operations, edges are the values that flow between them, and the backward pass adds more nodes and edges to the same graph. Once it is a graph, allocating memory to it is the same problem a compiler solves when it assigns a finite set of registers to a program's variables.

The key notion borrowed from there is liveness: a value is live from the moment it is produced until its last use, and two values whose live ranges do not overlap can safely share one buffer. That unlocks two optimizations. An in-place operation writes its output directly over an input that is already dead, so an activation function costs no new memory. Sharing recycles a buffer the moment its value is no longer needed by anyone downstream. The textbook version of finding the best such assignment, graph coloring, is actually NP-complete, so the paper does not solve it exactly: it sweeps the graph once with a liveness counter, freeing each buffer when its count of pending uses hits zero, which is linear time and good enough in practice. (The same analysis lets MXNet, which the authors built on, allocate every buffer statically before a single op runs, so it can report the exact memory a plan will use.)

The saving is real but bounded: in-place and sharing cut the feature-map memory by a factor of two to three. That is the lower bar in Figure 1, and the limit is visible in the picture: it is a constant factor. Both meters still climb in lockstep with depth, so a deep enough network hits the wall regardless of how cleverly you pack the buffers. To bend the line itself, rather than only lower it, you need an idea that changes the exponent.

Trade compute for memory

That idea is recomputation, and it goes like this: cut the chain of layers into contiguous segments; on the forward pass, keep only the activation at each segment boundary and drop everything inside; on the backward pass, when you arrive at a segment, replay its short forward run from the saved boundary to rebuild its internal activations, run the gradient through them, then free them again before moving to the next segment. You never hold more than one segment's worth of internals at a time, plus the handful of saved boundaries. Figure 2 below scrubs through the full sequence, forward pass keeping only the four checkpoints and the backward pass walking right-to-left lighting and freeing one segment at a time. In code it is short:

# checkpointed backprop over an n-layer chain, k segments  (Alg 1)
v = input
for seg in segments:             # FORWARD: keep only the boundaries
    checkpoint[seg] = v          # save this segment's input activation
    for layer in seg:            # run the segment, drop its internals
        v = layer.forward(v)
grad = loss_grad(v, target)
for seg in reversed(segments):   # BACKWARD: one segment at a time
    v = checkpoint[seg]          # restore the saved boundary
    cache = {}
    for layer in seg:            # RE-COMPUTE the dropped activations
        cache[layer] = v
        v = layer.forward(v)
    for layer in reversed(seg):  # backprop through the segment
        grad = layer.backward(grad, cache[layer])
    free(cache)                  # peak: one segment + k boundaries

The figure below runs it. The forward pass keeps only the amber checkpoints and discards the rest. Then the backward pass walks the segments from right to left: each one lights up as it is recomputed from its checkpoint (teal), the gradient is propagated through it, and it is freed. The memory trace underneath shows the saving. Instead of the plain-backprop ramp up to $n$ , it is a low sawtooth, each tooth one segment's recompute, with a peak nowhere near the top line.

Figure 2 · checkpointed backprop

step 1/53

A 16-layer chain in

k=4

segments. The forward pass stores only the 4 checkpoints; the backward pass recomputes each segment from its checkpoint, runs the gradient through it, then frees it. The memory trace is a sawtooth whose peak,

\approx n/k + k

, sits far below the plain-backprop peak of

n

. Press play or scrub.

Two things to take from the trace. The first is what it costs. Every activation that was dropped is recomputed exactly once, during its segment's backward step, so the total extra work is precisely one more forward pass over the network. That lands at about 30% more wall-clock time, not the 50% you'd guess if a backward pass cost the same as a forward. A backward pass already costs about twice a forward pass, because it does a forward's worth of arithmetic and then propagates the error, so an ordinary training step is roughly one forward plus two forwards' worth of backward, three units in all. Adding one more forward makes four, about a third more. The paper measures 30% on real hardware, a hair under the unit model's 33% because the recomputed forward runs a little cheaper than the original.

The second is what it does not cost. The recomputed activations are bit-for-bit the same arithmetic as the originals, so the gradients are identical to what plain backprop would have produced. The paper states that the method "gives equivalent weight gradient": this is a pure memory-for-compute trade with no effect on the model you get out.

The √n sweet spot

The last free parameter is the number of segments. Cut the network into $k$ of them and the peak memory has two parts, which the paper writes as

\text{cost-total} = \max_{i=1,\ldots,k}\,\text{cost-of-segment}(i) + O(k) = O\!\left(\frac{n}{k}\right) + O(k)

(1)

The two terms come straight from the mechanism. The $n/k$ is the size of the single segment you have to hold and recompute at the worst moment, when one segment's internals are all live for its backward step. The $k$ is the pile of saved checkpoints, one per boundary, that you carry throughout. The two terms trade off against each other: few segments means each one is enormous to recompute ( $n/k$ large), many segments means the checkpoints themselves become the cost ( $k$ large). You want the $k$ that makes the total smallest, so set the derivative to zero:

\frac{d}{dk}\!\left(\frac{n}{k} + k\right) = -\frac{n}{k^2} + 1 = 0 \quad\Longrightarrow\quad k = \sqrt{n}, \qquad \text{cost} = 2\sqrt{n}

The balance lands exactly where the two terms are equal, $n/k = k$ , which is $k=\sqrt{n}$ . Split an $n$ -layer network into $\sqrt{n}$ segments of $\sqrt{n}$ layers each and the peak is $2\sqrt{n}$ . Drag the slider below and watch the two component curves cross right under the bottom of the U; both ends of the slider, one giant segment and a checkpoint on every layer, cost the full $n$ .

Figure 3 · the √n minimum

k = 8

Peak memory

M(k) = n/k + k

against the number of segments

k

(here

n=64

). The falling n/k (one recomputed segment) and the rising k (the checkpoints) cross at

k=\sqrt{n}=8

, the bottom of the teal U, where memory is

2\sqrt{n}=16

. Both extremes cost

\approx n

The result is memory $O(\sqrt{n})$ at the cost of one extra forward pass. The square root accounts for most of the saving: a network whose activations would have needed a million units of memory now needs a thousand, and the bill for that is a third more time. The $\sqrt{n}$ covers only the intermediate feature maps, the activation tensors flowing through each layer, which dominate training memory for deep conv nets and unrolled RNNs. The parameters and the scratch space a convolution needs are a separate, unchanged line on the bill; checkpointing leaves them alone.

Pay even less: recursion

A segment is itself a chain of layers, so apply the very same trick inside it. Checkpoint within the segment, drop its sub-internals, recompute them recursively. Let $g(n)$ be the memory to do a forward-and-backward over $n$ layers when you store $k$ results and recurse on the pieces between them. Each level costs $k$ checkpoints and hands a chain of length $n/(k+1)$ to the next level down:

g(n) = k + g\!\left(\frac{n}{k+1}\right)

(2)

Unrolling that recursion stacks up one $k$ per level, and the number of levels is how many times you can divide $n$ by $k+1$ before reaching a single layer, which is $\log_{k+1} n$ . So

g(n) = k\,\log_{k+1} n

(3)

Choosing where to place the checkpoints at every level of that recursion is exactly the revolve algorithm from the automatic-differentiation literature, the optimal schedule for trading recomputation against memory. At the extreme, with $k=1$ you keep a single checkpoint at each level, halving the remaining chain every time, so (3) gives $g(n) = 1\cdot\log_2 n = \log_2 n$ . Logarithmic memory. The price is one more forward pass per level, so $\log_2 n$ forward passes in total instead of one. This instantiates a general theorem from that same literature (Griewank and Walther): for any $c$ , you can train in $O(c\,n)$ compute and $O(n^{1/c})$ memory. Plain backprop is $c=1$ ; the $\sqrt{n}$ scheme is $c=2$ ; driving $c$ up to $\log n$ takes memory down to $\log n$ .

Drag the slider to spend forward passes and watch the memory collapse. The curve has a sharp knee at two passes. That first extra forward pass, the step from storing everything to the $\sqrt{n}$ scheme, buys the bulk of the saving. Everything past it, the slow walk from $\sqrt{n}$ down to $\log n$ , costs many more passes for a sliver more memory.

Figure 4 · the memory–compute frontier

p = 2

Peak memory against the number of forward passes

p

, for

n=4096

(log memory axis). One pass stores everything (

n

); two passes is the √n scheme (

2\sqrt{n}=128

);

\log_2 n = 12

passes reach

\log n

memory. The knee at

p=2

is why

\sqrt{n}

is the setting everyone actually uses.

Which is exactly why $\sqrt{n}$ is the default and $\log n$ is a curiosity. The paper says as much: the logarithmic version "may not be used commonly" because running the forward pass $\log n$ times is a steep price for the last factor of memory. It is a theorem about how far the trade can be pushed, not a recommendation.

What it costs, what it buys

The experiments quantify the result. On a 1,000-layer residual network, checkpointing takes the runtime memory from 48 GB down to 7 GB, a touch under a 7× reduction, while adding about 30% to the wall-clock time. The gap is decisive in practice: with even the best linear allocation plan, the largest ResNet the authors' GPU could hold was a couple hundred layers; with checkpointing, a thousand layers fit in 7 GB. The figure below makes that concrete. Drag depth and watch the linear plans cross the GPU's memory ceiling after a few hundred layers while the sublinear plan stays under it into the thousands.

Figure 5 · the headline experiment

depth

Memory vs. depth (log–log). The linear plans grow with depth and cross the 12 GB GPU limit after a few hundred layers; the sublinear plan grows like

\sqrt{\text{depth}}

and stays under it. At 1,000 layers: 48 GB linear, 7 GB sublinear. Toggle to the LSTM for the same divergence on sequence length.

The second toggle in that figure shows this is not a vision trick: the authors run it on a four-layer LSTM with 1,024 hidden units unrolled over a long sequence. That is, the recurrence is laid out as one chain link per timestep, so a long sequence is effectively a very deep network and memory grows with the number of timesteps rather than layers, and the sublinear plan gives more than a 4× reduction over the best plan that does not recompute. (On recurrent nets the in-place optimization matters more, because it lets the per-timestep weight gradients accumulate directly into one buffer instead of allocating fresh space at every step.)

A note on what the numbers are measuring, since two different quantities are in play. The clean $\sqrt{n}$ curve is the feature-map estimate from MXNet's static allocator. Because every buffer is planned before a single op runs, the allocator can report a plan's exact feature-map memory ahead of time, which is why this curve is a clean prediction rather than a noisy measurement. The 48 GB to 7 GB headline is the total runtime memory read off nvidia-smi, parameters and workspace and all, which is why it is a roughly 7× drop rather than a literal square root.

The idea sits among relatives, because it is one of three ways people now get memory-efficient training. Checkpointing recomputes dropped activations from saved checkpoints, and it is the general, drop-in one: it works on any network with no change to the architecture. Reversible networks instead reconstruct each layer's activations from the next layer's, reaching constant memory in depth, but only by constraining the architecture so the layers are invertible. Neural ODEs get constant memory a third way: instead of saving the forward trajectory, they re-derive the gradients by integrating a companion (adjoint) differential equation backward through time, so the activations are reconstructed rather than stored. (See the Neural ODE explainer.) FlashAttention also recomputes in its backward pass, which is why it is often filed next to checkpointing, but its central idea is something else, IO-aware tiling to keep the attention computation in fast on-chip memory, and it only recomputes the small softmax statistics rather than full activations.

It is also the most widely adopted. Checkpointing now ships as a built-in option in every major framework, sitting behind a single flag (PyTorch's torch.utils.checkpoint, the activation-checkpointing switches in DeepSpeed and Megatron), and it is a large part of how anyone fits a big Transformer onto the hardware they have. The title omits one thing: the technique was not invented here. The core idea, and the optimal recursive scheduling that gives the $\log n$ result, come from the automatic-differentiation literature of the 1990s. This paper's contribution was to bring it to general deep networks, automate the planning, and measure that it pays, which is why it is the one everyone cites when they reach for it.

An activation you can cheaply recompute does not justify the memory to store it. Keep $\sqrt{n}$ of them, recompute the rest, pay about a third more compute, and depth stops being a memory problem.

Provenance Verified against primary literature

Griewank & Walther (2000)The revolve checkpointing algorithm and the logarithmic-memory result, from the AD literature this paper builds on.

Griewank & Walther (2008)The general trade-off: O(c·n) compute buys O(n^(1/c)) memory.

Dragon Book (Aho et al.)Register allocation and liveness analysis, the compiler analogy behind in-place and sharing.

He et al. (2016)Identity-mapping ResNets, which made the 1000-layer benchmark trainable at all.

FlashAttention (2022)A later recompute-in-backward method, related but not the same technique.

correctionGradient checkpointing was not invented here. The technique, and the optimal 'revolve' schedule behind the O(log n) result, come from the automatic-differentiation literature (Griewank & Walther, 2000); this paper popularized it for deep learning, on a tip from David Warde-Farley acknowledged in the text. Also: the O(√n) is activation memory only, not parameters or workspace, and the 48 GB → 7 GB headline is total runtime memory.

Questions you might still have

If you recompute activations, isn’t that doing the work twice?
Only one extra forward pass, not double everything. A backward pass already costs about twice a forward, so a normal step is ~3 forward-units; one extra forward makes it ~4, about a third more time. You pay ~30% compute to take memory from n down to √n.

Why √n, and not something smaller?
Peak memory is n/k + k: one live segment to recompute plus k stored checkpoints. The sum is smallest when the two are equal, at k=√n. You can go below √n only by recomputing recursively, which the frontier shows costs many more forward passes for very little more memory.

Does checkpointing change the trained model?
No. Recomputed activations are the same arithmetic as the originals, so the gradients are bit-for-bit what plain backprop would compute. The paper calls it "equivalent weight gradient." It is a pure memory-for-compute trade, with no effect on the result.

Is this the same thing FlashAttention does?
Related, not the same. Both recompute in the backward pass, but FlashAttention’s core idea is IO-aware tiling to keep attention in fast on-chip memory, and it only recomputes the small softmax statistics. Checkpointing is the general version that recomputes whole segments of any network.

Footnotes & further reading

The paper: Chen, Xu, Zhang, Guestrin, Training Deep Nets with Sublinear Memory Cost (2016). The implementation was built on MXNet, whose static allocator lets the exact feature-map memory of a plan be read off before training runs.
The origin of checkpointing and the logarithmic-memory result: Griewank & Walther, Algorithm 799: revolve (ACM TOMS, 2000), and the general compute/memory theorem in their book Evaluating Derivatives (2008).
Register allocation and liveness analysis, the compiler analogy: Aho, Lam, Sethi & Ullman, Compilers: Principles, Techniques, and Tools (the "Dragon Book").
The 1,000-layer benchmark was made trainable by He et al., Identity Mappings in Deep Residual Networks (2016).
The recompute-vs-reconstruct-vs-adjoint cousins: FlashAttention's IO-aware recomputation (Dao et al., 2022), reversible residual networks (Gomez et al., 2017), and the adjoint method of Neural ODEs (Chen et al., 2018).
The modern incarnation: PyTorch's torch.utils.checkpoint, and the activation-checkpointing options in DeepSpeed and Megatron-LM.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.