Deep Residual Learning for Image Recognition
Add the input back, and suddenly depth helps again.
For a while, making a network deeper past a point made it worse, even on the data it was training on. Residual learning fixed that with one change: each block adds its output to its input instead of replacing it. That single shortcut let 152-layer networks train, and win ImageNet.
Explaining the paperDeep Residual Learning for Image RecognitionWhat if the way to train a deeper network was to let most of it do nothing?
By 2015 the recipe for image recognition was clear: go deeper. Stacking more layers let a network build richer features, edges into textures into parts into objects, and every jump in the ImageNet competition had come from more depth. AlexNet had eight layers. VGG had nineteen. The obvious next move was to keep stacking. And it stopped working, in a way that did not make sense.
Take a network that trains well, bolt on more layers, train it the same way. You would expect, at worst, no harm: the extra layers could always learn to pass their input along untouched, leaving the accuracy where it was. Instead the deeper network did worse. Not on held-out test data, which would point to overfitting. It did worse on the training set. More layers, and the network fit the data it was being shown less well than before.
That is strange, because a deeper network contains the shallower one as a special case. There is a solution sitting right there, by construction, that should be at least as good. The optimizer just could not find it. The fix, when it came, was small: a one-line change to what each block computes. That change, residual learning, is in every ResNet, and the same skip connection is now in every Transformer block ever shipped.
To see why it works we build a small tower of ideas: what the degradation problem actually is, and why it is a failure of optimization rather than a limit of what the network can represent; how the residual reformulation makes doing nothing the default; why that makes the learning easier; how to keep the trick alive when the shapes change so you can go truly deep; and what 152 layers finally bought.
Deeper networks got worse
Start with the phenomenon, stated carefully. Take plain stacked convolutional networks, the straightforward kind with no shortcuts, and train them at a range of depths. Up to about twenty layers depth helps. Past that, accuracy saturates and then drops. The paper calls this the degradation problem, and the load-bearing detail is which error rises: the training error. A 56-layer plain net has higher training error than a 20-layer one, throughout training.
That single fact rules out the first explanation everyone reaches for. Overfitting means a model fits the training data too well and generalizes badly: low training error, high test error. Here the training error itself is higher. The deeper network is not memorizing too much. It is failing to fit at all.
The second explanation is vanishing or exploding gradients, the classic reason deep networks were once hard to train. This is the part most summaries get backwards, so it is worth being exact. By 2015 that problem was largely handled, by normalized initialization and by batch normalization, which rescales each layer's pre-activations using the current mini-batch's statistics. The ResNet authors trained their plain baselines with batch norm and checked directly: the forward signals keep non-zero variance, and the backward gradients keep healthy norms. Neither vanishes. So whatever is breaking the deep plain net, it is not that the signal is dying on the way through.
What is left is optimization itself. The deeper plain network is harder to optimize: the solver cannot drive its training error down the way it can for a shallower one. Note what kind of failure that is: not representation, since the deeper net can express the shallower one exactly (the next section builds that solution by hand), but search. Gradient descent fails to find, in any reasonable time, a solution that provably exists, as if the terrain it has to descend gets harder to navigate as depth grows. The paper does not claim to know the precise reason; it offers a conjecture, that deep plain nets may have "exponentially low convergence rates," and explicitly leaves the mechanism to future work. The contribution is not a theory of why plain nets fail; it is a reformulation that sidesteps the failure.
Below, the same comparison in motion. In plain mode the deeper net settles at higher error; flip to residual and the order reverses, the deeper net pulling ahead. That reversal is the result the rest of this explains.
A deep net should match a shallow one
The thought experiment that makes degradation paradoxical also points straight at the fix. Suppose you have a shallow network that works. Build a deeper one this way: copy the shallow network exactly, then append extra layers that each compute the identity, output equals input. The new layers change nothing, so the deeper network computes the identical function and gets the identical training error.
So a solution always exists: any deeper network can at least tie its shallower self, just by setting the extra layers to pass data through. If real training lands somewhere worse, that is the optimizer failing to find a solution we know is there. Degradation is not a ceiling on what the deeper network can represent. It is a search problem.
And the thought experiment quietly names its own catch. The easy solution needs those extra layers to implement the identity mapping. It turns out that a stack of convolutions, batch norms, and ReLUs is oddly bad at learning to copy its input exactly. Asking several nonlinear layers to conspire so their composition comes out to is an awkward target for gradient descent. The identity is simple to write down and apparently hard to learn. There is a reason it feels backwards: every part of those layers is built to transform, the convolutions to mix, the ReLUs to bend, and asking dozens of them to collectively reproduce their input untouched is fighting their design, each layer must precisely undo whatever the others do.
That reframes the entire problem into something narrow and actionable. We do not need a fundamentally different network. We need to make the identity mapping easy to represent, so the "do nothing extra" solution is sitting right where the optimizer starts.
Learn the change, not the whole map
The reformulation is one move. Write the mapping you want a few stacked layers to compute as . Instead of asking the layers to produce directly, ask them to produce only the difference from the input,
and then add the input back. The block's output is the residual function plus a copy of the input:
Here is what flows into the block, what flows out, and is a couple of weight layers with their own parameters. For the basic block that is two convolutions with a ReLU between them, batch-normed:
with the ReLU. The new part is the : a shortcut connection that skips the weight layers and adds the untouched input to their output. It has no parameters and costs almost nothing, just a copy and an addition. The ReLU is applied after the add.
Nothing was given up in expressiveness. Whatever the plain layers could have produced, the residual form can produce too, by letting learn . Same family of functions, rewritten. What changed is the default. To make a residual block do nothing, the identity, you no longer need the layers to reconstruct a copy of their input. You need , which means driving the weights toward zero, about the easiest thing an optimizer does. The awkward identity target moved to the origin.
The entire idea fits in the gap between two near-identical functions. Same layers, same batch norms, same ReLUs. One line apart:
# plain block: the two conv layers ARE the output
def plain(x):
out = relu(bn1(conv1(x))) # 3x3 conv -> BN -> relu
out = bn2(conv2(out)) # 3x3 conv -> BN
return relu(out) # this is H(x)
# residual block: the SAME body, plus one "+ x"
def residual(x):
out = relu(bn1(conv1(x))) # identical layers...
out = bn2(conv2(out)) # ...computing F(x)
return relu(out + x) # F(x) + x. that "+ x" is the ideaBelow is the block itself, plain versus residual. In plain mode the two weight layers alone must produce the whole mapping. In residual mode the identity shortcut bows up the side and adds the input back at the join, so the layers only have to supply the change.
Why the residual is easier
The paper states the reason as a hypothesis, and it is worth quoting the shape of it. It is easier to optimize the residual mapping than the original, unreferenced one. To the extreme: if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity by a stack of nonlinear layers. The paper does not prove residual learning is easier in general; it argues it, and then shows it works across many networks.
The substance is preconditioning, reshaping the problem so the optimizer starts near a good answer. In reality the ideal mapping is rarely exactly the identity. But in a deep network most blocks are doing small refinements, nudging a representation rather than rebuilding it, so the ideal mapping tends to sit closer to the identity than to zero. When that is true, a residual block has the easier job: with the shortcut supplying the identity for free, it starts at "change nothing" (its layers near zero output) and only has to learn the small correction, descending into a nearby groove instead of climbing from a cold start. A plain block has to synthesize the entire mapping from scratch, every time, even when the right answer is "leave it almost alone."
There is a concrete way to see why "do nothing" is so much cheaper in the residual form, in terms of how far the weights have to travel from where they start. Weights are initialized near zero, so a freshly initialized layer already outputs roughly zero. A residual block that should do nothing therefore needs only , which is already where it sits: the optimizer has to move essentially zero distance to land on the right answer. A plain block asked to do the same thing has to output its own input, which means building the identity map out of layers that currently output near zero. That is a move of size in every coordinate at once: each output dimension has to be coaxed from zero up to match the corresponding input dimension, and the weights have to conspire across the whole layer to do it. One target is the starting point itself, the other is a structured map the optimizer has to reach from far away.
The figure makes that difference concrete. A stack of layers starts near the zero function. Pick how far the ideal mapping sits from the identity. The plain block must build that whole curve from nothing; the residual block, already carrying the input for free, only has to learn the leftover. As the ideal approaches the identity, the residual's job shrinks toward nothing while the plain block's stays just as large:
The paper finds a fingerprint of exactly this. Measuring how much each layer changes its input, the standard deviation of its output, the residual layers' responses are smaller on average than a plain network's, and they shrink as the network gets deeper. That is what you would see if most layers were making small adjustments around the identity, which is what the residual form is built to make easy.
What about the gradients? Number the blocks so that is an early activation and a deeper one further along. Because each block just adds its residual, stacking them from to sums those residuals up:
Differentiating the loss with respect to the early activation, that sum hands the gradient a direct, weight-free path back:
The in the parentheses is a shortcut for the gradient too: it flows from any deep block back to any shallow one without passing through a single weight layer, so it cannot be choked off even if the paths attenuate. That is the "gradient highway" you have probably heard about, and it is real. It is also not the original paper's argument: the derivation comes from the authors' 2016 follow-up, Identity Mappings in Deep Residual Networks, and the original paper expressly said its plain baselines did not suffer vanishing gradients, because batch norm already kept the signal alive. "ResNets solved the vanishing gradient problem" is a tidy story that this paper does not tell. The problem it solved was optimization difficulty; the gradient-flow benefit is a real, related bonus that the next paper made precise.
Matching shapes, and going deeper
There is a catch in equation (1) that you only hit when you build a real network. The addition only makes sense if and have the same shape. But a convolutional network periodically changes shape: it doubles the number of channels and halves the spatial size to build up abstraction. At those transitions the input and the block's output no longer line up, and you cannot add them.
The fix is a projection on the shortcut, used only where the dimensions change:
Here is a convolution that reshapes to match . The paper tests three options: zero-pad the shortcut to keep it parameter-free (A), use a projection only at the dimension changes and plain identity everywhere else (B), or project on every shortcut (C). B edges out A, and C is marginally better than B but not worth the extra parameters. The takeaway is that the identity shortcut is doing the work; projections are a detail for matching shapes, not the source of the gains. So they keep the shortcut as a plain identity everywhere it can be one.
That frugality is what makes truly deep networks affordable, through one more design, the bottleneck block. Two convolutions at full width get expensive fast. So for the deep models the block becomes three layers: a convolution that squeezes the channels down, a that does its work at the cheaper narrow width, and a that restores the width. Same cost as the basic block, but it moves a much wider signal through. Read the channels as parallel pipes: the first pinches the bundle so the expensive works on a narrow stream, the last fans it back out, and the block keeps its full-width interface to its neighbors while paying narrow-width prices inside. Here the parameter-free identity earns its keep again: putting a projection on the wide ends would roughly double the block's cost.
The numbers on this are worth keeping. A 152-layer ResNet runs at 11.3 billion FLOPs (floating-point operations, the standard yardstick for compute), still lower than VGG-19's 19.6 billion despite being eight times deeper. That is compute, not parameters, and ResNet wins on parameters too. The deep network is not just trainable now; it is cheaper than the shallow one it replaced.
What the depth bought
The cleanest result is the reversal the whole paper was built to produce. On ImageNet, a plain 34-layer network has higher error than a plain 18-layer one (28.54% versus 27.94% top-1), the degradation problem in two data points. Swap in residual blocks and it flips: the 34-layer ResNet beats the 18-layer ResNet, and beats the plain 34-layer net by 3.5 percentage points (25.03 versus 28.54 top-1). At the left edge the two nearly tie at 18 layers. When a network is not overly deep, the plain solver still finds a good solution; the residual shortcut earns its advantage as depth grows.
With degradation out of the way, depth paid off at a scale nobody had reached. The flagship was 152 layers deep, with a single-model top-5 error of 4.49% on ImageNet, and an ensemble at 3.57% on the test set that won the ILSVRC 2015 classification competition. The same representations carried over: the residual networks also took first place that year in ImageNet detection and localization and in COCO detection and segmentation. The depth was not a stunt for one benchmark. It transferred.
The CIFAR-10 experiments push the point further, and contain a trap worth defusing. A ResNet keeps improving as it deepens, 20 to 32 to 44 to 56 to 110 layers, error falling from 8.75% to 6.43%. Then a 1202-layer version does worse, 7.93%. That looks like degradation returning, but it is not. The 1202-layer net optimizes fine, its training error still drives down near zero. It has 19.4 million parameters aimed at a tiny dataset, so it overfits. That is the ordinary bias-variance story, the opposite failure from the plain-net degradation, where the deep net could not fit the training data at all. Residual learning removed the optimization wall; it did not repeal the need for enough data.
The shortcut everyone kept
The residual connection outlived its original setting almost immediately. Within two years it was load-bearing in a very different architecture: every block of a Transformer wraps its attention and its feed-forward sublayer in the same move, output equals input plus what the sublayer computed, and for the reason this paper found, it lets you stack many layers and train them. The ResNet paper itself says nothing about Transformers, which came later; the inheritance runs only forward. But it does run: the skip connection in the language model you used today is a direct descendant of a 2015 vision paper.
The same little equation also kept attracting new readings. One sees a residual block,
as a single Euler step of a differential equation, with depth playing the role of time; that view, sharpened by Lu and by Haber and Ruthotto in 2017 and popularized by Neural ODEs in 2018, is the bridge our DiffusionBlocks explainer walks across. Another reads a residual network as an implicit ensemble of many shallow paths (Veit and colleagues, 2016). A third shows that skip connections smooth the loss landscape, keeping it navigable at depths where a plain net's turns chaotic (Li and colleagues, 2018). Useful lenses, all three, and all three later work, none of it in the original paper: it supplied the mechanism, and the explanations came after.
Step back and the argument is four steps long. Depth should help, because a deeper network can always copy a shallower one. So when a deep plain net does worse, that is the optimizer failing to find a solution we know exists, not a limit on what it could represent. Make that missing solution easy to reach by adding the input back, so a block that should do nothing only has to learn nothing. Then each layer learns a small correction instead of an entire mapping, and the optimizer can actually find it. One line of arithmetic, , and the depth that used to hurt started to help.
Questions you might still have
If degradation is not overfitting or vanishing gradients, what is it?
An optimization difficulty. The deeper plain net is harder for SGD to fit, even on the training set. The paper conjectures deep plain nets have "exponentially low convergence rates" and leaves the full reason as future work.
Does adding the input back change what the network can compute?
No. F(x) + x can represent any mapping H by setting F = H - x, so the function class is identical. What changes is which solutions are easy to reach: doing nothing now means F = 0, the easiest target there is.
Why does the 1202-layer CIFAR net do worse if depth helps?
That is overfitting, not degradation. It optimizes fine (training error near zero) but has 19.4M parameters for a tiny dataset. The plain-net degradation problem is the opposite: there the deeper net could not even fit the training set.
Did ResNet solve the vanishing gradient problem?
Not as the paper frames it. Its plain baselines already used batch norm, so signals did not vanish; it targeted the degradation problem. The gradient-highway reading (a "+1" in the backward pass) is the 2016 follow-up, Identity Mappings in Deep Residual Networks.
Footnotes & further reading
- The paper: He, Zhang, Ren, Sun, Deep Residual Learning for Image Recognition (Microsoft Research, CVPR 2016). Original code and models.
- The follow-up that supplies the clean gradient-flow argument and the pre-activation block: He, Zhang, Ren, Sun, Identity Mappings in Deep Residual Networks (2016). The "" backward path lives here, not in the original.
- Batch normalization, which keeps the plain baselines' signals alive (and is why degradation is not a vanishing-gradient problem): Ioffe & Szegedy, Batch Normalization (2015). Its "internal covariate shift" rationale was later contested by Santurkar et al. (2018), who argue its real effect is smoothing the optimization landscape.
- The very-deep baseline ResNet measures itself against: Simonyan & Zisserman, Very Deep Convolutional Networks (VGG) (2014).
- The concurrent gated-shortcut work whose gates can close, in contrast to the parameter-free identity here: Srivastava, Greff & Schmidhuber, Highway Networks (2015).
- Later readings of the residual block (all post-date the paper): the ODE / Euler-step view in Neural ODEs (Chen et al., 2018, building on Lu et al. and Haber & Ruthotto, 2017); the ensemble-of-shallow-paths view (Veit et al., 2016); and the loss-landscape view (Li et al., 2018).
How could this explainer be improved? Found an error, or something unclear? I read every message.