Vision · Architecture

Very Deep Convolutional Networks for Large-Scale Image Recognition

Make a network deeper with nothing but stacked 3×3 filters, and it gets more accurate.

Simonyan and Zisserman held every other choice fixed and raised depth from 11 to 19 layers of small 3×3 convolutions. The error dropped with each added layer; the result took second place at ImageNet 2014 and became a template the field reused for years.

Explaining the paperVery Deep Convolutional Networks for Large-Scale Image RecognitionKaren Simonyan, Andrew Zisserman · Visual Geometry Group, Oxford · ICLR 2015 · arXiv:1409.1556 ↗

The network that taught computer vision to go deep did not win the contest it was built for. It came second, and the field copied it anyway.

In 2014 the proving ground for image recognition was ImageNet, a thousand-way classification benchmark with 1.3 million training photos. Convolutional networks had only just taken it over: AlexNet won in 2012 and set off a scramble to improve it. Most of that work fiddled with the front of the network, a smaller first filter here, a different stride there. Simonyan and Zisserman asked a narrower, cleaner question: if you hold every other choice fixed and change only the depth, how far does depth alone carry you?

Their answer was a family of networks that differ in exactly one thing, the number of layers. Same 3×3 filters everywhere, same stride, same pooling, the same classifier bolted on top; only the count of layers changes, from 11 up to 19. Add layers and the error falls, until it flattens out around 19. The two deepest, known today as VGG-16 and VGG-19, became the default vision backbone for years: the thing you reached for when you needed features that worked.

One myth to clear before anything else, because it shapes how people remember the paper. VGG is often described as the network that won ImageNet 2014. It did not. GoogLeNet won the classification track; VGG came second. (VGG did win the separate localisation track, a different competition on the same images.) What let VGG outlast the actual winner was not its score but its shape: a stack of identical small filters, plain enough to hold in your head and general enough to keep working when you pointed it at a new task.

The paper rests on a handful of plain ideas: what a 3×3 convolution computes and why it leaves the image size alone, why stacking small filters beats one big filter, how the spatial-shrinks-channels-grow shape works and where the weights actually sit, and what it took to train and test a network this deep in 2014. Each is simple on its own. Taken in order, they explain the design.

The experiment: change only the depth

The reason earlier papers could not say whether depth helped is that they changed several things at once. A new entry might use a smaller first-layer filter, a different stride, and a few more layers, then report a better score. Which change earned the gain? You could not tell. To attribute an improvement to depth, you have to freeze everything else and move depth alone.

So that is what they froze. Every network in the paper shares the same scaffolding:

Input is a fixed 224×224 RGB crop, with the only preprocessing being to subtract the training-set mean color from every pixel.
Every convolution is 3×3, with stride 1 and one pixel of zero padding, so it leaves the spatial size unchanged.
All spatial shrinking is done by five max-pooling layers, each a 2×2 window with stride 2 that halves height and width.
On top sits the same classifier in every net: two fully-connected layers of 4096 units, then a 1000-way layer and a softmax.
Every hidden layer is followed by a ReLU.

With all of that nailed down, the one remaining knob is how many convolutional layers you stack between the poolings, and the paper sweeps exactly that, from a network of 11 weight layers up to one of 19. (A "weight layer" means a layer with learnable parameters, so convs and the three fully-connected layers count; pooling, ReLU, and softmax do not.) Because nothing else moves, any change in accuracy is depth's doing, and what comes out is a clean monotone trend, error falling as layers are added. We will meet the full family, and its error at every depth, in the configuration figure further down.

A 3×3 conv that keeps its size

Start with the single layer the network is built from. A convolution slides a small filter across the image, and at each position it multiplies the filter's weights against the little patch of input underneath and sums them into one output number. Two properties make this both powerful and cheap. The filter only ever looks at a small local window, because the things that matter for a low-level pattern (an edge, a corner) are local. And the same filter weights are reused at every position: a detector that is useful in one corner of the image is useful everywhere, so you learn one small filter instead of separate weights per location. A 3×3 filter you can picture as a stencil dragged across a wall, the cut-out shape fixed and pressed down at every spot. The picture captures reuse and locality, but it leaves out one thing: a real filter spans the full depth of its input, looking at all of the incoming channels at once, not a flat image.

A layer is a bank of these filters. It takes $C_{\text{in}}$ input channels and produces $C_{\text{out}}$ output channels, one per filter, and each filter has shape $C_{\text{in}}\times 3\times 3$ . Because the weights are shared across every position, the layer's parameter count does not depend on the image size at all: it is $C_{\text{out}}\,(C_{\text{in}}\cdot 3^2 + 1)$ , filters times weights-per-filter plus a bias each. Hold on to that fact, because it is why the deep part of VGG is so light.

The output grid's size follows a simple rule. For an input of width $W$ , a filter of size $F$ , padding $P$ on each side, and stride $S$ :

o = \left\lfloor \frac{W - F + 2P}{S} \right\rfloor + 1

(1)

Plug in VGG's choice of $F=3$ , $S=1$ , $P=1$ and the output is $(W - 3 + 2)/1 + 1 = W$ : exactly the size it started at. That is not automatic. With no padding, a 3×3 conv shrinks each side by two, and a deep stack of them would eat the image away to nothing. VGG pads by one pixel precisely so the convolutions preserve resolution, which means all of the deliberate downsampling can be left to the five pooling layers. Convolutions detect features at full resolution; pooling does the downsampling. (The size-preserving padding rule for stride 1 is $P=(F-1)/2$ , which only lands on a whole number when $F$ is odd. A 3×3 is the smallest filter that has a real center pixel with a neighbor in every direction, which is the paper's stated reason for choosing it: the smallest window that can tell left from right and up from down.)

Stacking small filters

The paper turns a single observation into a design. Stack two 3×3 convolutions back to back and the second layer's neurons see a 5×5 patch of the original input. Stack three and they see 7×7. The reach of one neuron, the set of input pixels that can influence it, is called its receptive field, and it grows by one pixel on every side with each layer you add.

Walk it once so the growth is not magic. A neuron in the first conv layer sees a 3×3 window of the input. A neuron in the second layer sees a 3×3 window of first-layer neurons, and each of those already saw its own 3×3 of the input. Slide those windows over each other and they overlap heavily, so the total reach is 5×5, not 3 plus 3. Each further layer pushes the boundary out by one ring, giving a field of side $2n+1$ after $n$ layers:

\text{receptive field of } n \text{ stacked } 3\times3 \text{ convs} = (2n+1)\times(2n+1)

(2)

So two layers reach 5×5, three reach 7×7, four reach 9×9. Depth buys reach, and it buys it cheaply. Drag the slider below and watch the field bloom outward, one colored ring per layer; the ring at $n=3$ is the 7×7 the next section prices against a single big filter.

Figure 1 · receptive field of a stack

stacked 3×3 layersn = 3

One output neuron (the white dot), n 3×3 layers up, depends on a centered square of input pixels. Each added layer extends the field by one pixel on every side, so the side grows as

2n+1

: two layers reach 5×5, three reach 7×7. The rings are colored by which layer first reached them.

Why small-and-deep wins

If three stacked 3×3 layers reach the same 7×7 patch as one 7×7 layer, why prefer the stack? The paper gives two reasons, and both fall out of comparing them head to head with the same channel count $C$ going in and out.

The first reason is parameters. One 3×3 conv layer has $3^2 C^2 = 9C^2$ weights, so a stack of three has $27C^2$ . A single 7×7 layer has $7^2 C^2 = 49C^2$ . Same reach, very different cost:

\underbrace{3\cdot(3^2 C^2) = 27\,C^2}_{\text{three } 3\times3} \quad<\quad \underbrace{7^2 C^2 = 49\,C^2}_{\text{one } 7\times7}

(3)

Since $49/27 \approx 1.81$ , the single 7×7 carries about 81% more weights than the three-layer stack. (Read the direction carefully: the paper says the 7×7 has 81% more, measured against the stack; the same gap is a 45% reduction the other way. The two percentages have different bases.) The deeper option is the cheaper one.

The second reason is non-linearity, and it is the one the "same receptive field" framing tends to hide. Between the three 3×3 convs sit two ReLUs, where the single 7×7 has none inside it. Each ReLU bends the function the layer computes, and three bends make a more expressive map than one. This is the part to state precisely: the stack is not equivalent to a 7×7 conv. It matches only in reach. With the ReLUs in place it is a strictly larger family of functions, and if you removed them the three linear convolutions would collapse back into a single (rank-limited) 7×7 linear map. The paper frames this as a kind of built-in regularisation: you are forcing a 7×7 filter to factor through three 3×3 filters with a non-linearity wedged between each, which is both cheaper and harder to overfit. A loose picture: reaching across a gap with three short arm segments hinged together rather than one rigid pole. Same reach, but it can bend at each joint, and it uses less material. The picture breaks where it matters most, though, since arms do not compose a function; the hinges, the ReLUs, are exactly what the analogy cannot convey.

Drag the receptive field below and watch both effects at once. The amber bar (one big conv) pulls ahead of the teal bar (the 3×3 stack) in weight count, while the stack racks up more ReLUs. At 7×7 the gap is the paper's 81%.

Figure 2 · two ways to cover the same field

receptive field7×7

For a fixed

(2n+1)\times(2n+1)

receptive field, one big conv against a stack of n 3×3 convs. Weights are in units of

C^2

. At 7×7 the single conv has 49

C^2

against the stack's 27

C^2

, 81% more, while the stack has three ReLUs to the single conv's one.

The 1×1 convolution

One configuration in the paper, called C, throws in a third kind of layer: the 1×1 convolution. A 1×1 conv sounds like it does nothing, but the "1×1" only describes its spatial footprint. At each pixel it still applies a learned $C_{\text{out}}\times C_{\text{in}}$ matrix to the channel vector sitting there, mixing all the input channels into all the output channels, then passes the result through a ReLU:

\mathbf{y}_{ij} = \mathrm{ReLU}(W\,\mathbf{x}_{ij} + \mathbf{b}), \qquad W \in \mathbb{R}^{C_{\text{out}}\times C_{\text{in}}}

(4)

It recombines what each pixel already encodes across channels, recolors it by a fixed recipe, and never looks at a neighbor, so it leaves the receptive field untouched. (The idea comes from the "Network in Network" paper of Lin et al., 2014.) When the input and output channel counts match, the linear part is a same-size projection that could in principle be folded away; what it actually buys is the extra ReLU. So config C is a clean way to add non-linearity to the network without changing how far any neuron can see.

That makes the A-to-E family a controlled experiment within the controlled experiment. Going from config B to config C adds the 1×1 convs, and the error improves (top-5 single-scale error 9.9% down to 8.8%): the extra non-linearity genuinely helps. Going from C to the same-depth config D swaps those 1×1 convs for 3×3 convs, and the error improves again (8.8% down to 8.1%): the 3×3s capture spatial context that a per-pixel 1×1 cannot. The lesson lands in order. Added non-linearity helps; added spatial reach helps more. This is also why "VGG-16" means config D specifically. Config C is also 16 layers, but it spends three of them on 1×1 convs and ends up the worse of the two.

The shape, and where the weights are

Zoom out from a single layer to the silhouette of the network. As the image flows upward, the five poolings halve the spatial resolution each time, taking a 224×224 input down through 112, 56, 28, 14, to a final 7×7 grid. Going the other way, the channel width doubles after each pooling, 64, 128, 256, 512, and then caps: the last two stages both stay at 512 rather than doubling to 1024. The reasoning behind the trade is one of budget. As you throw away spatial resolution you can afford to describe each remaining location with a richer, higher-dimensional vector, and capping at 512 keeps the cost from exploding. The classifier finally reads a modest 7×7 grid of 512-dimensional descriptors.

Now the part that surprises people. The name says "very deep," and every diagram of VGG shows off the deep conv stack, yet that stack holds almost none of the network's weights. Flatten the final 7×7×512 feature map and it is a vector of 25,088 numbers; the first fully-connected layer maps that to 4096 units, which takes $7\cdot7\cdot512\cdot4096 \approx 102.76$ million weights in that one layer alone. The three fully-connected layers together hold about 89% of VGG-16's 138 million parameters. The thirteen convolutional layers, the entire "very deep" part, account for only about 10%, roughly 14.7 million. "Very deep" describes the layer count, not the weight count. Tap through the stages below to see the funnel shrink and the weight bar fill up almost entirely with that one fully-connected layer.

Figure 3 · the anatomy of VGG-16

stageFC1

The funnel: box height is the feature-map size (224² down to 7²), brightness is the channel count (64 up to 512, capped). The bar splits all 138M weights by where they sit: the 13 conv layers are ~10%, while FC1 alone is ~74%. Tap a stage for its shape and parameter count.

The same network in code, with the per-stage repeats and shapes visible at a glance:

# VGG-16 (configuration D): a uniform 3x3 stack, 16 weight layers.
# convs(x, k, n): n stacked 3x3 convs (pad 1), width k, each + ReLU.
def vgg16(x):                        # x: [B, 3, 224, 224], mean-subtracted
    x = pool(convs(x, 64,  2))       # stage 1 -> 112x112x64
    x = pool(convs(x, 128, 2))       # stage 2 ->  56x56x128
    x = pool(convs(x, 256, 3))       # stage 3 ->  28x28x256
    x = pool(convs(x, 512, 3))       # stage 4 ->  14x14x512
    x = pool(convs(x, 512, 3))       # stage 5 ->   7x7x512
    x = flatten(x)                   # 7*7*512 = 25088
    x = dropout(relu(fc(x, 4096)))   # FC1: 25088 -> 4096   (102.76M weights)
    x = dropout(relu(fc(x, 4096)))   # FC2: 4096  -> 4096
    return fc(x, 1000)               # FC3 -> 1000 logits -> softmax

That fully-connected head is also why VGG is remembered as a heavy model, but the heaviness splits across two different axes. The parameters concentrate in the FC layers, as we just saw. The compute, on the other hand, concentrates in the conv layers, because they run their small filters over large spatial maps (a 3×3 conv on a 112×112 map does work at every one of 12,544 positions), while the FC layers act once on a single flattened vector. So VGG is expensive to store because of the FC head and expensive to run because of the convs. Later architectures noticed both problems and dropped the giant FC head for a global average pool, but that is a story for the networks that came after.

The configurations, A to E

With the building blocks in hand, the experiment is easy to read off. The paper lays out six networks. A has 11 weight layers; A-LRN is A plus one local response normalisation layer; B has 13; C has 16 (with the three 1×1 convs); D has 16 (3×3 throughout, this is VGG-16); E has 19 (VGG-19). They share the identical classifier head and differ only in how many conv layers fill each stage. Tap a column below to read its depth, parameter count, and best single-scale top-5 error.

Figure 4 · the configuration ladder

configD

Six nets A–E as stacks of their weight layers (3×3 conv teal, brighter for wider stages; 1×1 conv amber; FC violet). Tap a column. The error row falls with depth, 10.4 → 8.0, and saturates at 19; C (1×1 convs, 8.8) lands worse than the same-depth D (8.1).

Three things in that figure carry the paper. First, the error falls almost monotonically as depth grows, from 10.4% top-5 at config A to 8.0% at config E. Second, A-LRN is no better than A, so local response normalisation, a per-channel rescaling carried over from AlexNet, earns nothing here and is dropped from every deeper net. Third, the error stops moving at 19 layers: E barely improves on D, and the paper calls the gain saturated rather than claiming deeper is always better.

One more control nails the central claim. They took config B and replaced each pair of 3×3 convs with a single 5×5 conv, which has the same receptive field by the rule from Figure 1, so the only thing that changed was depth and the missing ReLUs. The shallow version measured about 7% higher top-1 error. Same reach, fewer layers, worse result: deep-and-small beats shallow-and-big, measured directly.

The saturation at 19 layers is where this paper hands off to the next one. If error stops improving when you stack more plain layers, the obvious move is to ask why. The reason found later was that very deep plain networks become harder to optimise, not only prone to overfit. A year later ResNet named that the degradation problem and fixed it with identity shortcuts, which let networks go past a hundred layers. Keep the two findings separate: VGG reports depth saturating at 19, while ResNet reports plain depth actively degrading (its deeper plain nets had higher training error too). VGG hit a ceiling; ResNet showed the floor was dropping underneath it and built a new one.

Training before batch norm

It is easy to forget how recently the standard tools arrived. VGG was submitted in September 2014; batch normalization, the technique that later made deep nets shrug off bad initialisation, was still months away. So the paper had to solve the startup problem by hand, and a bad start was a real hazard: initialise a deep network carelessly and the gradients can stall before learning takes hold.

Their workaround was to grow into the depth. First they trained the shallow config A from scratch, with weights drawn from a small Gaussian (mean 0, variance 0.01, so a standard deviation of 0.1) and biases at zero. Then, when training a deeper net, they copied A's first four conv layers and its last three fully-connected layers into the new network as a warm start, leaving the middle layers random. The shallow net was trainable from noise; once trained, it gave the deeper nets a stable scaffold to start from. (A footnote in the paper adds that, after submission, they found the initialisation scheme of Glorot and Bengio removed the need for this pre-training entirely. The warm start was a workaround for its moment, not a lasting requirement.)

The rest of the recipe is ordinary supervised training, listed here because the exact numbers matter to anyone reproducing it: the multinomial logistic loss, optimised by mini-batch gradient descent with momentum 0.9 and batch size 256; L2 weight decay of $5\times10^{-4}$ and dropout of 0.5 on the first two FC layers; a learning rate that starts at $10^{-2}$ and is divided by ten whenever validation accuracy plateaus, three times in all, stopping after 370K iterations (74 epochs). Each crop is augmented with random horizontal flips and small random shifts in RGB color. On four NVIDIA Titan Black GPUs a single network took two to three weeks to train.

One training choice has its own payoff and reappears at test time: scale. The 224×224 crop is cut from a training image first rescaled so its shorter side is some length $S$ . Fix $S$ small and a crop spans most of the image; make it large and a crop is a close-up of one object. Rather than commit to one scale, they sampled $S$ randomly from $[256, 512]$ for each image, which the paper calls scale jittering: a single network trained to recognise objects across a wide range of sizes, the way you would learn a word by seeing it on flashcards held at different distances.

Running over any size image

The cleverest piece of the paper lives at test time, and it depends on the size-preservation from way back in the convolution section. A trained network expects a fixed 224×224 input, because the first fully-connected layer needs a fixed-length vector. But a fully-connected layer is the same operation as a convolution: an FC layer reading a 7×7×512 block computes exactly what a 7×7 conv would, and the two 4096-unit layers above it are 1×1 convs. Rewrite the three FC layers that way and the entire network becomes one large stack of convolutions, with no fixed-size assumption left anywhere.

A fully-convolutional network can run over an image of any size. Feed it something larger than 224 and, instead of a single 1000-way prediction, it returns a small grid of them, one class-score vector per location, which the network then averages into a single answer. This is the dense evaluation borrowed from the OverFeat work of Sermanet et al. (2014). Drag the test scale below: at 224 the score map is a single cell, the ordinary fixed-size network, and as the image grows the map fills out into a grid of predictions to be averaged. The arithmetic is exact for VGG, where the conv stack divides the side by 32 and the 7×7 FC-as-conv trims six, leaving a map of side $Q/32 - 6$ .

Figure 5 · dense evaluation

test scale Q384

The FC layers re-read as convolutions let the net run on a Q×Q image. The conv stack divides the side by 32 and the 7×7 FC-conv leaves a score map of side

Q/32-6

, each cell a 1000-way prediction; averaging gives one answer. At Q=224 the map is a single cell.

The paper also keeps the older approach for comparison: cropping the image into 150 fixed windows (50 per scale across three scales) and averaging the network's answer over all of them. Dense and multi-crop evaluation turn out to be complementary rather than redundant, because they treat the image borders differently. A crop is padded with zeros at its edges; the dense map sees real neighbouring pixels there instead. Each captures something the other misses, so averaging the two beats either alone, taking config D from 7.5% top-5 with dense evaluation to 7.2% with both. No single evaluation method dominates here; the gains stack.

Where it landed

Pulled together, the numbers are the ones that fed the ImageNet 2014 entry. The best single VGG network (config E) scored 7.0% top-5 error on the test set. A two-model ensemble of D and E, assembled after the competition deadline, reached 6.8%. The actual seven-model submission scored 7.3%, good for second place in classification behind GoogLeNet's 6.7%. The same VGG networks won the localisation track outright. One detail cuts against the second-place finish: a single VGG net (7.0%) beat a single GoogLeNet (7.9%) by nearly a full point. VGG lost the competition on ensembling, not on the strength of one model.

What made the paper outlast its runner-up finish was generalisation. The authors released configs D and E publicly, and their features transferred well past ImageNet: take a VGG trained on ImageNet, lop off the classifier, and the activations underneath were strong enough to set records on other datasets even when fed to a plain linear classifier with no fine-tuning. A network built to win one benchmark doubled as a general-purpose vision feature extractor, and for years "use VGG features" was a sensible first move on almost any image task.

The paper's claim, stripped to one line, is modest: keep the filters tiny, keep the design uniform, and spend the complexity budget on depth. That uniform 3×3 stack became a template the field built on for years. The line runs straight from here to ResNet, which broke the depth ceiling VGG ran into; to backbones like U-Net that still stack VGG-style 3×3 blocks; and onward to the Vision Transformer, which finally replaced the convolution stack with attention. VGG was not the most accurate network of 2014, but it was the one everyone could build on.

Provenance Verified against primary literature

AlexNet (2012)The ConvNet baseline VGG keeps and deepens; its first layer was a large 11×11 stride-4 conv.

Network-in-Network (2014)The 1×1 convolution as a per-pixel channel mix, used in configuration C.

OverFeat (2014)Dense, fully-convolutional evaluation: turn the FC layers into convs and run over any size.

GoogLeNet (2014)Concurrent very-deep net (22 layers, 1×1/3×3/5×5); won the ILSVRC-2014 classification track.

ResNet (2015)The follow-up that broke the depth ceiling VGG hit, with identity shortcuts.

correctionPopular accounts say VGG won ImageNet 2014. It did not: GoogLeNet won the classification track (6.7% top-5 test error) and VGG placed second (7.3%, a 7-net ensemble). VGG did win the separate localisation track. The detail that flatters VGG: its best single network (7.0%) beat a single GoogLeNet (7.9%).

Questions you might still have

Did VGG win ImageNet 2014?
No. GoogLeNet won the classification track with 6.7% top-5 test error; VGG placed second at 7.3%. VGG did win the separate localisation track. A single VGG net (7.0%) did beat a single GoogLeNet (7.9%), and a two-model VGG ensemble reached 6.8% after the deadline, but the competition entry came second.

Is a stack of three 3×3 convs the same as one 7×7?
Only its reach is the same. Both cover a 7×7 patch of the input, but the three 3×3 layers have two ReLUs between them, which makes the stack a strictly larger family of functions than any single 7×7 conv. Delete those ReLUs and the three linear convs collapse into one restricted 7×7 map. So "same receptive field" is about which pixels are seen, not about what can be computed.

If the conv layers hold so few weights, why is VGG heavy to run?
Parameters and compute live in different places. About 89% of the 138M weights sit in the three fully-connected layers (FC1 alone is 102.76M), but almost all of the arithmetic is in the conv layers, because they run over large spatial maps while the FC layers act on a single 7×7×512 vector. VGG is expensive on both axes for opposite reasons.

Why not just keep adding layers past 19?
VGG found the error stopped improving at 19 layers, and guessed deeper might help on larger datasets. Stacking many more plain layers actually makes them harder to optimise, which is the degradation problem that ResNet (He et al., 2015) named and fixed a year later with identity shortcuts.

Footnotes & further reading

The paper: Simonyan & Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition (Visual Geometry Group, Oxford; ICLR 2015). The released models D and E are the original VGG-16 and VGG-19.
The ConvNet baseline VGG deepens: Krizhevsky, Sutskever & Hinton, ImageNet Classification with Deep Convolutional Neural Networks (AlexNet, 2012).
The 1×1 convolution as a channel mixer: Lin, Chen & Yan, Network In Network (2014).
Dense, fully-convolutional evaluation: Sermanet et al., OverFeat (2014).
The concurrent very-deep net that won ILSVRC-2014 classification: Szegedy et al., Going Deeper with Convolutions (GoogLeNet, 2014).
The follow-up that broke the depth ceiling: He, Zhang, Ren & Sun, Deep Residual Learning for Image Recognition (ResNet, 2015), explained here.
The initialisation noted post-submission: Glorot & Bengio, Understanding the difficulty of training deep feedforward neural networks (AISTATS 2010).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.