Vision · Architecture

Deep Residual Learning for Image Recognition

Add the input back, and suddenly depth helps again.

For a while, making a network deeper past a point made it worse, even on the data it was training on. Residual learning fixed that with one change: each block adds its output to its input instead of replacing it. That single shortcut let 152-layer networks train, and win ImageNet.

Explaining the paperDeep Residual Learning for Image RecognitionHe, Zhang, Ren, Sun · Microsoft Research · CVPR 2016 · arXiv:1512.03385 ↗

To train a deeper network, let most of it do nothing.

By 2015 the recipe for image recognition was clear: go deeper. Stacking more layers let a network build richer features, edges into textures into parts into objects, and every jump in the ImageNet competition had come from more depth. AlexNet had eight layers. VGG had nineteen. The obvious thing to try next was to keep stacking. And it stopped working.

Consider what happens when a network that trains well has more layers bolted on and is trained the same way. You would expect, at worst, no harm: the extra layers could always learn to pass their input along untouched, leaving the accuracy where it was. Instead the deeper network did worse. Not on held-out test data, which would point to overfitting. It did worse on the training set. More layers, and the network fit the data it was being shown less well than before.

That is strange, because a deeper network contains the shallower one as a special case. A solution exists by construction that should be at least as good, but the optimizer could not find it. A small change resolved it: a one-line change to what each block computes. That change, residual learning, is in every ResNet, and the same skip connection is now in every Transformer block ever shipped.

To see why it works, a few ideas carry it: what the degradation problem actually is, and why it is a failure of optimization rather than a limit of what the network can represent; how the residual reformulation makes doing nothing the default; why that makes the learning easier; how to keep the shortcut working when the shapes change so you can go truly deep; and what 152 layers finally bought.

Deeper networks got worse

Stated carefully, the phenomenon runs as follows. Plain stacked convolutional networks, the straightforward kind with no shortcuts, were trained at a range of depths. Up to about twenty layers depth helps. Past that, accuracy saturates and then drops. The paper calls this the degradation problem, and the detail that matters is which error rises: the training error. A 56-layer plain net has higher training error than a 20-layer one, throughout training.

That single fact rules out the first explanation. Overfitting means a model fits the training data too well and generalizes badly: low training error, high test error. Here the training error itself is higher. The deeper network is not memorizing too much.

The second explanation is vanishing or exploding gradients, the classic reason deep networks were once hard to train. By 2015 that problem was largely handled, by normalized initialization and by batch normalization, which rescales each layer's pre-activations using the current mini-batch's statistics. The ResNet authors trained their plain baselines with batch norm and checked directly: the forward signals keep non-zero variance, and the backward gradients keep healthy norms. Neither vanishes. So whatever is breaking the deep plain net, it is not that the signal is dying on the way through.

That leaves optimization itself. The deeper plain network is harder to optimize: the solver cannot drive its training error down the way it can for a shallower one. The kind of failure matters here: not representation, since the deeper net can express the shallower one exactly (a construction below builds that solution by hand), but search. Gradient descent fails to find, in any reasonable time, a solution that provably exists, and this difficulty grows with depth. The paper does not claim to know the precise reason; it offers a conjecture, that deep plain nets may have "exponentially low convergence rates," and explicitly leaves the mechanism to future work. The contribution is not a theory of why plain nets fail; it is a reformulation that sidesteps the failure.

Below, the same comparison in motion. In plain mode the deeper net settles at higher error; flip to residual and the order reverses, the deeper net reaching lower error.

Figure 1 · the degradation problem

Training error over training. As a plain network goes from 18 to 34 layers, its training error gets worse: stacking layers hurt, and not from overfitting, since it is the training error that rose. Switch to residual blocks and depth helps again, the deeper net now reaching the lowest error. Curve shapes are illustrative; the ordering is the paper's result.

A deep net should match a shallow one

The thought experiment that makes degradation paradoxical also points straight at the fix. Suppose you have a shallow network that works. Build a deeper one this way: copy the shallow network exactly, then append extra layers that each compute the identity, output equals input. The new layers change nothing, so the deeper network computes the identical function and gets the identical training error.

So a solution always exists: any deeper network can at least tie its shallower self, by setting the extra layers to pass data through. If real training lands somewhere worse, that is the optimizer failing to find a solution we know is there. Degradation is not a ceiling on what the deeper network can represent.

And the thought experiment names its own difficulty. The easy solution needs those extra layers to implement the identity mapping, and a stack of convolutions, batch norms, and ReLUs is oddly bad at learning to copy its input exactly. Asking several nonlinear layers to combine so their composition comes out to $\mathcal{H}(\mathbf{x}) = \mathbf{x}$ is an awkward target for gradient descent. The identity is simple to write down and apparently hard to learn. There is a reason it feels backwards: every part of those layers is built to transform, the convolutions to mix, the ReLUs to bend, and asking dozens of them to collectively reproduce their input untouched is fighting their design, each layer must precisely undo whatever the others do.

That narrows the problem. We do not need a fundamentally different network. We need to make the identity mapping easy to represent, so the "do nothing extra" solution is sitting right where the optimizer starts.

Learn the change, not the whole map

The reformulation is one move. Write the mapping you want a few stacked layers to compute as $\mathcal{H}(\mathbf{x})$ . Instead of asking the layers to produce $\mathcal{H}(\mathbf{x})$ directly, ask them to produce only the difference from the input,

\mathcal{F}(\mathbf{x}) := \mathcal{H}(\mathbf{x}) - \mathbf{x}

and then add the input back. The block's output is the residual function plus a copy of the input:

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}

(1)

Here $\mathbf{x}$ is what flows into the block, $\mathbf{y}$ what flows out, and $\mathcal{F}$ is a couple of weight layers with their own parameters. For the basic block that is two $3\times3$ convolutions with a ReLU between them, batch-normed:

\mathcal{F}(\mathbf{x}, \{W_i\}) = W_2\,\sigma(W_1\mathbf{x})

with $\sigma$ the ReLU. The new part is the $+\,\mathbf{x}$ : a shortcut connection that skips the weight layers and adds the untouched input to their output. It has no parameters and costs almost nothing, a copy and an addition. The ReLU is applied after the add.

Nothing was given up in expressiveness. Whatever $\mathcal{H}$ the plain layers could have produced, the residual form can produce too, by letting $\mathcal{F}$ learn $\mathcal{H} - \mathbf{x}$ . Same family of functions, rewritten. The default changed. To make a residual block do nothing, the identity, you no longer need the layers to reconstruct a copy of their input. You need $\mathcal{F} = 0$ , which means driving the weights toward zero, about the easiest thing an optimizer does. Now the easy target sits at the origin, where the weights already are.

The entire idea fits in the gap between two near-identical functions. Same layers, same batch norms, same ReLUs. One line apart:

# plain block: the two conv layers ARE the output
def plain(x):
    out = relu(bn1(conv1(x)))      # 3x3 conv -> BN -> relu
    out = bn2(conv2(out))          # 3x3 conv -> BN
    return relu(out)               # this is H(x)

# residual block: the SAME body, plus one "+ x"
def residual(x):
    out = relu(bn1(conv1(x)))      # identical layers...
    out = bn2(conv2(out))          # ...computing F(x)
    return relu(out + x)           # F(x) + x. that "+ x" is the idea

Below is the block itself, plain versus residual. In plain mode the two weight layers alone must produce the mapping. In residual mode the identity shortcut routes around the weight layers and adds the input back at the join, so the layers only have to supply the change.

Figure 2 · a building block

The same two

3\times3

conv layers in both. A plain block must build the whole desired mapping

\mathcal{H}(\mathbf{x})

from those layers alone. A residual block adds a parameter-free identity shortcut, a straight copy of

\mathbf{x}

, so the layers only learn the residual

\mathcal{F}

and the block outputs

\mathcal{F}(\mathbf{x}) + \mathbf{x}

Why the residual is easier

The paper states the reason as a hypothesis, and the shape of the claim matters. It is easier to optimize the residual mapping than the original, unreferenced one. To the extreme: if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity by a stack of nonlinear layers. The paper does not prove residual learning is easier in general; it argues it, and then shows it works across many networks.

Residual learning preconditions the problem, reshaping it so the optimizer starts near a good answer. In reality the ideal mapping is rarely exactly the identity. But in a deep network most blocks are doing small refinements, nudging a representation rather than rebuilding it, so the ideal mapping tends to sit closer to the identity than to zero. When that is true, a residual block has the easier job: with the shortcut supplying the identity at no parameter cost, it starts at "change nothing" (its layers near zero output) and only has to learn the small correction, making a small adjustment instead of building the mapping from scratch. A plain block has to synthesize the mapping from scratch, every time, even when the right answer is "leave it almost alone."

There is a concrete way to see why "do nothing" is so much cheaper in the residual form, in terms of how far the weights have to travel from where they start. Weights are initialized near zero, so a freshly initialized layer already outputs roughly zero. A residual block that should do nothing therefore needs only $\mathcal{F} = 0$ , which is already where it sits: the optimizer has to move essentially zero distance to land on the right answer. A plain block asked to do the same thing has to output its own input, which means building the identity map $\mathcal{H}(\mathbf{x}) = \mathbf{x}$ out of layers that currently output near zero. That is a move of size $O(1)$ in every coordinate at once: each output dimension has to be coaxed from zero up to match the corresponding input dimension, and the weights across the layer must all change together to do it. One target is the starting point itself, the other is a structured map the optimizer has to reach from far away.

The figure makes that difference concrete. A stack of layers starts near the zero function. The figure lets you set how far the ideal mapping sits from the identity. The plain block must build that whole curve from nothing; the residual block, already carrying the input through the shortcut, only has to learn the leftover. As the ideal approaches the identity, the residual's job shrinks toward nothing while the plain block's stays as large:

Figure 3 · the easier target

ideal mappingfar from id.

Both blocks start near the zero function. The plain block has to learn the mapping

\mathcal{H}

(left, the amber gap). The residual block only learns the leftover

\mathcal{F} = \mathcal{H} - \mathbf{x}

(right, the teal sliver), because the shortcut already supplies

\mathbf{x}

. Drag the ideal mapping toward the identity and the residual's work vanishes.

The paper finds a fingerprint of exactly this. Measuring how much each layer changes its input, the standard deviation of its output, the residual layers' responses are smaller on average than a plain network's, and they shrink as the network gets deeper. That is what you would see if most layers were making small adjustments around the identity, which is what the residual form is built to make easy.

Figure 4 · layer responses

Standard deviation of each

3\times3

layer's response (after BN, before the nonlinearity), in original layer order, redrawn after the paper's Figure 7. Flip the toggle: the residual traces sit visibly below their plain counterparts at 20 and 56 layers, and ResNet-110 sits lower still, each layer of a deeper ResNet modifies the signal less. The paper publishes the plot, not a table, so the curves are illustrative; the orderings are its result. Small responses are what layers learning small corrections look like.

What about the gradients? Number the blocks so that $\mathbf{x}_l$ is an early activation and $\mathbf{x}_L$ a deeper one further along. Because each block adds its residual, stacking them from $l$ to $L$ sums those residuals up:

\mathbf{x}_L = \mathbf{x}_l + \sum_{i=l}^{L-1}\mathcal{F}_i(\mathbf{x}_i)

Differentiating the loss $E$ with respect to the early activation, that sum gives the gradient a direct, weight-free path back:

\frac{\partial E}{\partial \mathbf{x}_{l}} = \frac{\partial E}{\partial \mathbf{x}_{L}}\left(1 + \frac{\partial}{\partial \mathbf{x}_{l}}\sum_{i=l}^{L-1}\mathcal{F}_i\right)

The $1$ in the parentheses is a shortcut for the gradient too: it flows from any deep block back to any shallow one without passing through a single weight layer, so it cannot be choked off even if the $\mathcal{F}$ paths attenuate. That is the "gradient highway" you have probably heard about, and it is real. It is also not the original paper's argument: the derivation comes from the authors' 2016 follow-up, Identity Mappings in Deep Residual Networks, and the original paper expressly said its plain baselines did not suffer vanishing gradients, because batch norm already kept the signal alive. The problem it solved was optimization difficulty; the gradient-flow benefit is a real, related bonus that the next paper made precise.

Matching shapes, and going deeper

Equation (1) has a complication that you only hit when you build a real network. The addition $\mathcal{F}(\mathbf{x}) + \mathbf{x}$ only makes sense if $\mathcal{F}(\mathbf{x})$ and $\mathbf{x}$ have the same shape. But a convolutional network periodically changes shape: it doubles the number of channels and halves the spatial size to build up abstraction. At those transitions the input and the block's output no longer line up, and you cannot add them.

Where the dimensions change, the authors project the shortcut to match:

\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + W_s\,\mathbf{x}

(2)

Here $W_s$ is a $1\times1$ convolution that reshapes $\mathbf{x}$ to match $\mathcal{F}(\mathbf{x})$ . The paper tests three options: zero-pad the shortcut to keep it parameter-free (A), use a projection only at the dimension changes and plain identity everywhere else (B), or project on every shortcut (C). B edges out A, and C is marginally better than B but not worth the extra parameters. The identity shortcut does the work; projections are a detail for matching shapes, not the source of the gains. So they keep the shortcut as a plain identity everywhere it can be one.

That frugality is what makes truly deep networks affordable, through one more design, the bottleneck block. Two $3\times3$ convolutions at full width get expensive fast. So for the deep models the block becomes three layers: a $1\times1$ convolution that squeezes the channels down, a $3\times3$ that does its work at the cheaper narrow width, and a $1\times1$ that restores the width. Same cost as the basic block, but it moves a much wider signal through. The channels are processed in parallel: the first $1\times1$ pinches the bundle so the expensive $3\times3$ works on a narrow stream, the last $1\times1$ fans it back out, and the block keeps its full-width interface to its neighbors while paying narrow-width prices inside. The parameter-free identity matters here for a concrete reason: putting a projection on the wide ends would roughly double the block's cost.

Figure 5 · the bottleneck block

The basic block is two

3\times3

convs at full width (ResNet-18/34). The bottleneck block wraps one

3\times3

between a

1\times1

that squeezes 256 channels down to 64 and a

1\times1

that restores them, so the channel count narrows in the middle. Both cost about the same (3.6B vs 3.8B FLOPs), but the bottleneck carries a 4x wider signal, which is what makes 50, 101, and 152 layers practical.

The numbers settle it. A 152-layer ResNet runs at 11.3 billion FLOPs (floating-point operations, the standard yardstick for compute), still lower than VGG-19's 19.6 billion despite being eight times deeper. That is compute, not parameters, and ResNet wins on parameters too. The deep network is now trainable and cheaper than the shallow one it replaced.

What the depth bought

The cleanest result is the reversal this paper was built to produce. On ImageNet, a plain 34-layer network has higher error than a plain 18-layer one (28.54% versus 27.94% top-1 error, the fraction of images whose single highest-scoring class is wrong), the degradation problem in two data points. Swap in residual blocks and it flips: the 34-layer ResNet beats the 18-layer ResNet, and beats the plain 34-layer net by 3.5 percentage points (25.03 versus 28.54 top-1). At 18 layers the plain and residual nets nearly tie. When a network is not overly deep, the plain solver still finds a good solution; the residual shortcut improves on the plain net only as depth grows.

Figure 6 · depth versus error

ImageNet (10-crop top-1 error): the plain net worsens from 18 to 34 layers (27.94 to 28.54), while the ResNet improves (27.88 to 25.03). Toggle to CIFAR-10 to watch a ResNet keep improving from 20 to 110 layers (8.75 to 6.43), then tick back up at 1202 layers (7.93). All numbers are verbatim from the paper.

With degradation out of the way, depth paid off at a scale nobody had reached. The flagship was 152 layers deep, with a single-model top-5 error of 4.49% on ImageNet, and an ensemble at 3.57% on the test set that won the ILSVRC 2015 (the ImageNet Large Scale Visual Recognition Challenge) classification competition. The same representations carried over: the residual networks also took first place that year in ImageNet detection and localization and in COCO detection and segmentation.

The CIFAR-10 experiments push the point further. A ResNet keeps improving as it deepens, 20 to 32 to 44 to 56 to 110 layers, error falling from 8.75% to 6.43%. Then a 1202-layer version does worse, 7.93%. That looks like degradation returning, but it is not. The 1202-layer net optimizes fine, its training error still drives down near zero. It has 19.4 million parameters aimed at a tiny dataset, so it overfits. That is the ordinary bias-variance story, the opposite failure from the plain-net degradation, where the deep net could not fit the training data at all.

The shortcut everyone kept

The residual connection outlived its original setting almost immediately. Within two years it was load-bearing in a very different architecture: every block of a Transformer wraps its attention and its feed-forward sublayer in the same move, output equals input plus what the sublayer computed, and for the reason this paper found, it lets you stack many layers and train them. The ResNet paper itself says nothing about Transformers, which came later; the inheritance runs only forward. But it does run: the skip connection in the language model you used today is a direct descendant of a 2015 vision paper.

The same little equation also kept attracting new readings. One sees a residual block,

\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\mathbf{x}_l)

as a single Euler step of a differential equation, with depth playing the role of time; that view, sharpened by Lu and by Haber and Ruthotto in 2017 and popularized by Neural ODEs in 2018, is the bridge our DiffusionBlocks explainer walks across. Another reads a residual network as an implicit ensemble of many shallow paths (Veit and colleagues, 2016). A third shows that skip connections smooth the loss landscape, keeping it navigable at depths where a plain net's would turn chaotic (Li and colleagues, 2018). Useful lenses, all three, and all three later work, none of it in the original paper: it supplied the mechanism, and the explanations came after.

The chain of reasoning is short, and each link is forced by the one before it. Depth should help, because a deeper network can always copy a shallower one. So when a deep plain net does worse, the optimizer is failing to find a solution we know exists, not running into a limit on what it could represent. Adding the input back makes that missing solution easy to reach: a block that should do nothing now only has to learn nothing. Each layer is then free to learn a small correction instead of an entire mapping, and the optimizer can actually find it. One line of arithmetic, $+\,\mathbf{x}$ , and the depth that used to hurt started to help.

Provenance Verified against primary literature

VGG (2014)The very-deep 3x3 baseline and the complexity yardstick (VGG-19 = 19.6B FLOPs).

Batch Norm (2015)Keeps signals healthy, which is why the plain-net failure is not vanishing gradients.

Highway Nets (2015)Concurrent gated shortcuts; the contrast with the parameter-free identity.

Identity Mappings (2016)The authors’ own follow-up: the clean additive gradient path.

correctionPopular accounts say ResNets fixed vanishing gradients. The original paper argues the opposite: with batch norm, signals do not vanish, and the degradation it targets is an optimization problem. The clean gradient-highway argument is the 2016 follow-up, not this paper.

Questions you might still have

If degradation is not overfitting or vanishing gradients, what is it?
An optimization difficulty. The deeper plain net is harder for SGD to fit, even on the training set. The paper conjectures deep plain nets have "exponentially low convergence rates" and leaves the full reason as future work.

Does adding the input back change what the network can compute?
No. F(x) + x can represent any mapping H by setting F = H - x, so the function class is identical. What changes is which solutions are easy to reach: doing nothing now means F = 0, the easiest target there is.

Why does the 1202-layer CIFAR net do worse if depth helps?
That is overfitting, not degradation. It optimizes fine (training error near zero) but has 19.4M parameters for a tiny dataset. The plain-net degradation problem is the opposite: there the deeper net could not even fit the training set.

Did ResNet solve the vanishing gradient problem?
Not as the paper frames it. Its plain baselines already used batch norm, so signals did not vanish; it targeted the degradation problem. The gradient-highway reading (a "+1" in the backward pass) is the 2016 follow-up, Identity Mappings in Deep Residual Networks.

Footnotes & further reading

The paper: He, Zhang, Ren, Sun, Deep Residual Learning for Image Recognition (Microsoft Research, CVPR 2016). Original code and models.
The follow-up that supplies the clean gradient-flow argument and the pre-activation block: He, Zhang, Ren, Sun, Identity Mappings in Deep Residual Networks (2016). The " $1+\dots$ " backward path lives here, not in the original.
Batch normalization, which keeps the plain baselines' signals alive (and is why degradation is not a vanishing-gradient problem): Ioffe & Szegedy, Batch Normalization (2015). Its "internal covariate shift" rationale was later contested by Santurkar et al. (2018), who argue its real effect is smoothing the optimization landscape.
The very-deep baseline ResNet measures itself against: Simonyan & Zisserman, Very Deep Convolutional Networks (VGG) (2014).
The concurrent gated-shortcut work whose gates can close, in contrast to the parameter-free identity here: Srivastava, Greff & Schmidhuber, Highway Networks (2015).
Later readings of the residual block (all post-date the paper): the ODE / Euler-step view in Neural ODEs (Chen et al., 2018, building on Lu et al. and Haber & Ruthotto, 2017); the ensemble-of-shallow-paths view (Veit et al., 2016); and the loss-landscape view (Li et al., 2018).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.