VerifiedarXiv:1502.0316724 min
Training · Optimization

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Re-center and re-scale every layer's inputs as the network trains, and it learns far faster.

Deep networks used to train slowly and need careful babysitting. Batch Normalization fixes the mean and variance of each layer's inputs on every mini-batch, which lets you use much higher learning rates and revives the saturating nonlinearities everyone had given up on.

Explaining the paperBatch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate ShiftIoffe, Szegedy · Google · ICML 2015 · arXiv:1502.03167

Nudge one early layer and you reshuffle the numbers every later layer sees. Stop the reshuffling and a network that used to crawl takes off.

Deep networks are trained by stochastic gradient descent: show the network a mini-batch of examples, measure how wrong it is, push every weight a little in the direction that lowers the loss, and repeat a few million times. It works, and in the years before 2015 it was also finicky. You set the learning rate low, you initialized the weights with care, and even then the deepest networks trained painfully slowly. Stacking many layers made the optimization fragile in a way nobody could fully pin down.

Ioffe and Szegedy put a name to one cause. Every layer's input is the previous layer's output, so the instant you update an early layer, the distribution of numbers arriving at every layer above it moves. The layers above are chasing a target that slides around underneath them, and they waste effort re-adapting to it. The paper calls this internal covariate shift: the change in the distribution of a layer's inputs caused by training the layers below. That name turned out to be the most disputed claim in the paper, while the operation it motivated stuck.

Batch Normalization makes normalization a step in the network itself: standardize each layer's inputs to mean zero and variance one over the current mini-batch, then let the layer learn to scale and shift them back however it likes. A few ideas carry the rest, and each is simple alone: why a sliding input distribution stalls training, why you cannot just normalize and walk away, the exact normalize-then-rescale operation, how gradients flow through it, what has to change at test time, and why faster training at learning rates that used to blow up follows.

Each layer's inputs keep shifting

Start with the concrete damage a sliding distribution does, using the nonlinearity the paper keeps returning to. A sigmoid, g(x)=1/(1+ex)g(x) = 1/(1+e^{-x}), squashes any input into (0,1)(0,1). Its slope is g(x)=g(x)(1g(x))g'(x) = g(x)\,(1-g(x)), which peaks at 0.250.25 when x=0x=0 and falls toward zero as xx moves either way. Out at x=5x = 5 the slope is about 0.00660.0066. Since backpropagation multiplies by that local slope on the way down, an input sitting far out on the sigmoid's flat shoulder passes almost no gradient to the weights that produced it. Those weights stop learning. This is the saturation problem, and it is exactly what internal covariate shift triggers.

Walk the failure once. Early in training the pre-activations sit near zero, in the steep middle of the sigmoid, and gradients flow. A few thousand steps later the layers below have drifted the batch mean out to, say, +4+4, and most of the batch now lands on the flat part where the slope is a few thousandths. The gradient reaching those weights is a rounding error, and that layer quietly stalls. Deeper networks make it worse, because the drift compounds layer over layer. The usual escapes (ReLU instead of sigmoid, careful initialization, a smaller learning rate) treat the symptom. Batch Normalization goes after the cause by keeping the pre-activation distribution centered, so the batch stays where the slope lives.

Drag the drift in the figure below and watch the batch climb the curve. As its mean pushes into the flat tail, the points pile up where the sigmoid is level and the gradient-signal readout falls toward zero. Then flip Batch Norm on: the batch snaps back to the steep middle no matter how far the layers below have pushed it.

Figure 1 · saturation and the moving distribution
μ = 3.2
A mini-batch of pre-activations sitting on a sigmoid. Drag the drift μ and the batch slides into the flat tail, where the slope gg' is near zero and the gradient signal collapses. Turn Batch Norm on and the batch is re-centered to the steep middle at every drift, so the gradient stays alive. The faint ghost shows where the layers below pushed the mean; the arrow, where BN puts it back.

So a moving input distribution is not merely untidy. When it wanders into saturation it starves whole layers of gradient, and the network grinds to a halt. Hold the distribution still and the saturating sigmoid, long written off as too hard to train deep, becomes usable again, a claim the results section gets to put a number on.

Why you can't just normalize

The obvious move is to standardize each layer's inputs from the data, before each step, as a preprocessing pass. The authors tried a stripped version of exactly that and the model blew up, with one parameter racing off to infinity. Seeing why is what forces normalization to live inside the network.

Suppose a layer adds a learned bias bb, so its value is x=u+bx = u + b, and you normalize by subtracting the running mean: x^=(u+b)E[u+b]\hat{x} = (u+b) - \mathrm{E}[u+b]. Now take a gradient step on bb that ignores the fact that E[x]\mathrm{E}[x] itself depends on bb. The step nudges bb by some Δb\Delta b (proportional to /x^-\partial\ell/\partial\hat{x}, the gradient with respect to the normalized value). Recompute the normalized output afterward:

(u+b+Δb)E[u+b+Δb]  =  (u+b)E[u+b](u + b + \Delta b) - \mathrm{E}[u + b + \Delta b] \;=\; (u + b) - \mathrm{E}[u + b]
(1)

The Δb\Delta b cancels itself. It shifts the mean by exactly the amount it shifts xx, so the mean-subtracted output, and therefore the loss, are unchanged. With no change in loss, the gradient on bb stays the same as before, so the next step nudges bb the same direction again, and bb grows without bound until the arithmetic overflows. Because the normalization subtracts the mean, and the mean travels with bb, any change to bb is erased by the normalization that follows it. The gradient step is pushing on a direction the normalization makes invisible.

Scrub the training steps in the figure below. In the naive mode bb climbs forever while the layer output and the loss sit dead flat, a useless drift that continues until the numbers explode. Switch the gradient to aware and bb stops moving, because once the gradient accounts for the normalization, changing bb provably cannot change the loss, so its gradient is exactly zero.

Figure 2 · the bias that runs away
step 0
Normalize outside the gradient and the bias b climbs without bound while the layer output and the loss never move (top vs bottom panel). Every step takes the same useless nudge the normalization erases. Flip the gradient to aware and bb sits still: its true gradient through the normalization is zero.

So normalization cannot be a bookkeeping pass bolted on between gradient updates. It has to be a layer the network differentiates through, so backprop accounts for how the mean and variance depend on the parameters. That dependence is the term the naive version dropped. Once it is included, normalizing Wu+bW u + b makes the additive bias bb redundant (its gradient through the mean subtraction is zero, and the learned shift β\beta covers what it used to do), so batch-normalized layers simply drop the bias.

The batch-normalizing transform

Now the actual operation, applied to a single feature (one neuron's pre-activation). Across a mini-batch of mm examples you have mm values of that feature. Take their mean and variance, standardize, then apply a learned scale and shift:

μB=1mi=1mxibatch meanσB2=1mi=1m(xiμB)2batch variancex^i=xiμBσB2+ϵnormalizeyi=γx^i+βscale and shift\begin{aligned} \mu_{\mathcal B} &= \frac{1}{m}\sum_{i=1}^m x_i & &\text{batch mean} \\[4pt] \sigma_{\mathcal B}^2 &= \frac{1}{m}\sum_{i=1}^m (x_i-\mu_{\mathcal B})^2 & &\text{batch variance} \\[4pt] \hat{x}_i &= \frac{x_i-\mu_{\mathcal B}}{\sqrt{\sigma_{\mathcal B}^2+\epsilon}} & &\text{normalize} \\[4pt] y_i &= \gamma\,\hat{x}_i + \beta & &\text{scale and shift} \end{aligned}
(2)

Reading the four lines: μB\mu_{\mathcal B} and σB2\sigma_{\mathcal B}^2 are the mean and variance of that feature over the batch. The variance divides by mm, the biased estimator, and that choice is deliberate; it returns in the inference section. The ϵ\epsilon is a small constant added to the variance inside the square root so the division stays safe when a feature is nearly constant. The result x^i\hat{x}_i has mean zero and variance one across the batch.

That last fact looks like it should hurt the network. Force a sigmoid's inputs to mean zero and variance one and you have trapped it in its near-linear middle, throwing away its ability to saturate when saturating is the right answer. So the transform does not stop at standardizing. It adds two learned parameters per feature, a scale γ\gamma and a shift β\beta, and the layer's real output is yi=γx^i+βy_i = \gamma\,\hat{x}_i + \beta. The network can learn back any mean and spread it wants. It can even learn the identity: set

γ=Var[x],β=E[x]yi=xi\gamma = \sqrt{\mathrm{Var}[x]}, \quad \beta = \mathrm{E}[x] \quad\Longrightarrow\quad y_i = x_i
(3)

and you recover the original pre-normalization activation exactly (neglecting ϵ\epsilon). So normalization costs the layer nothing in representational power: at worst it learns to undo itself, and at best it starts from a better-conditioned place. Drag γ\gamma and β\beta in the figure below, or press "set to identity," and watch the output distribution detach from the fixed normalized one.

Figure 3 · normalize, then scale and shift
One mini-batch through the transform. The raw batch x has some arbitrary mean and spread; normalizing produces , always centered at 0 with unit spread; then y = γx̂ + β puts any mean (β) and spread (γ) back. Setting γ=σ\gamma=\sigma, β=μ\beta=\mu recovers the raw batch, so the layer keeps every option it had.

Two choices in (2) are worth a sentence each, because the textbook version of "normalize" is more ambitious. The classical move is to whiten: decorrelate the features and give them unit covariance, which needs the full covariance matrix and its inverse square root over the data. That is expensive (a matrix inverse-root every step), not always differentiable, and singular whenever the batch is smaller than the number of features. Batch Normalization makes two cheap simplifications instead: normalize each feature on its own (a mean and a variance, no cross-feature terms), and estimate those statistics from the current mini-batch rather than the entire dataset. The mini-batch estimate is precisely what lets the statistics ride along in backpropagation, the non-negotiable property the runaway-bias example forced.

In a convnet, and where to insert it

Batch Normalization goes immediately before the nonlinearity, normalizing the pre-activation x=Wu+bx = Wu+b, so z=g(Wu+b)z = g(Wu+b) becomes z=g(BN(Wu))z = g(\mathrm{BN}(Wu)) (the bias dropped, as above). It normalizes WuWu, not the raw layer input uu, on purpose: uu is the output of an earlier nonlinearity whose distribution shape keeps changing, whereas WuWu is a sum of many terms and tends to be smoother and more symmetric, the kind of distribution that fixing the first two moments actually helps.

For a convolutional layer you want the same normalization at every spatial position of a feature map (the convolutional property), so the statistics are pooled over the batch and all locations: a feature map of size p×qp\times q in a batch of mm gives an effective batch of mpqm\cdot p\cdot q values, with one γ,β\gamma,\beta learned per feature map rather than per activation. That is the entire operating layer: one mean, one variance, a standardize, and a learned scale and shift, slotted in ahead of each nonlinearity.

Pushing gradients through the batch

For the transform to live inside the network, every step in (2) has to be differentiable, including the batch statistics. The unusual part is that the output for example ii depends on all the other examples in the batch, through the shared μB\mu_{\mathcal B} and σB2\sigma_{\mathcal B}^2, so the gradient for one example picks up paths routed through those shared quantities. The chain rule gives the full set (before any simplification):

x^i=yiγσB2=i=1mx^i(xiμB)12(σB2+ϵ)3/2μB=(i=1mx^i1σB2+ϵ)+σB2i=1m2(xiμB)mxi=x^i1σB2+ϵ+σB22(xiμB)m+μB1mγ=i=1myix^i,β=i=1myi\begin{aligned} \frac{\partial \ell}{\partial \hat{x}_i} &= \frac{\partial \ell}{\partial y_i}\cdot \gamma \\[4pt] \frac{\partial \ell}{\partial \sigma_{\mathcal B}^2} &= \sum_{i=1}^m \frac{\partial \ell}{\partial \hat{x}_i}\,(x_i-\mu_{\mathcal B})\cdot \tfrac{-1}{2}\,(\sigma_{\mathcal B}^2+\epsilon)^{-3/2} \\[4pt] \frac{\partial \ell}{\partial \mu_{\mathcal B}} &= \Big(\sum_{i=1}^m \frac{\partial \ell}{\partial \hat{x}_i}\cdot \tfrac{-1}{\sqrt{\sigma_{\mathcal B}^2+\epsilon}}\Big) + \frac{\partial \ell}{\partial \sigma_{\mathcal B}^2}\cdot \frac{\sum_{i=1}^m -2(x_i-\mu_{\mathcal B})}{m} \\[4pt] \frac{\partial \ell}{\partial x_i} &= \frac{\partial \ell}{\partial \hat{x}_i}\cdot \tfrac{1}{\sqrt{\sigma_{\mathcal B}^2+\epsilon}} + \frac{\partial \ell}{\partial \sigma_{\mathcal B}^2}\cdot \frac{2(x_i-\mu_{\mathcal B})}{m} + \frac{\partial \ell}{\partial \mu_{\mathcal B}}\cdot \frac{1}{m} \\[4pt] \frac{\partial \ell}{\partial \gamma} &= \sum_{i=1}^m \frac{\partial \ell}{\partial y_i}\,\hat{x}_i, \qquad \frac{\partial \ell}{\partial \beta} = \sum_{i=1}^m \frac{\partial \ell}{\partial y_i} \end{aligned}
(4)

Reading the lines: the incoming gradient is scaled by γ\gamma to reach x^i\hat{x}_i. The σB2\sigma_{\mathcal B}^2 and μB\mu_{\mathcal B} lines collect how every example's normalized value shifts when the shared variance and mean shift, and the gradient to the input xix_i sums three paths: the direct one through x^i\hat{x}_i, plus the two indirect ones through the batch variance and mean. The learned γ\gamma and β\beta get the obvious sums. It is the chain rule applied through a couple of averages, and it gives backprop a path through the normalization itself, the property the runaway-bias argument said was required. Because these derivatives are cheap, a BN layer trains with whatever optimizer you already use, SGD, momentum, or Adagrad, and the batch statistics are part of what backprop differentiates.

Train on batches, test on the world

At test time a BN layer's output for an example still depends on the rest of its mini-batch, through the batch mean and variance. During training that coupling is fine, and even useful. At inference you want a deterministic answer that depends only on the input in front of you, not on whichever examples happen to be batched with it. So once training finishes, BN swaps its per-batch statistics for fixed population ones, estimated over the training data (in practice with moving averages kept during training) and then frozen:

x^=xE[x]Var[x]+ϵ,Var[x]=mm1EB ⁣[σB2]\hat{x} = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x]+\epsilon}}, \qquad \mathrm{Var}[x] = \frac{m}{m-1}\,\mathrm{E}_{\mathcal B}\!\left[\sigma_{\mathcal B}^2\right]
(5)

Here Var[x]\mathrm{Var}[x] is the unbiased population variance: average the per-batch (biased, ÷m\div m) variances over training batches, then multiply by m/(m1)m/(m-1) to remove the bias. This is the deliberate asymmetry flagged back in (2). Training normalizes by the biased ÷m\div m variance because that is the statistic the gradient was computed through, so consistency demands it; inference wants the best estimate of the true variance, which is the unbiased one. With the statistics frozen, BN is a fixed linear map per feature, and you can fold the standardize together with γ,β\gamma,\beta into a single affine transform:

y=γVar[x]+ϵx+(βγE[x]Var[x]+ϵ)y = \frac{\gamma}{\sqrt{\mathrm{Var}[x]+\epsilon}}\,x + \left(\beta - \frac{\gamma\,\mathrm{E}[x]}{\sqrt{\mathrm{Var}[x]+\epsilon}}\right)
(6)

so at deployment BN costs essentially nothing, merging into the neighboring linear layer. In code the only thing that changes between training and inference is which mean and variance you use:

# batch norm for one feature, over a mini-batch x[1..m]
def batch_norm(x, gamma, beta, eps, training, run_mean, run_var, mom):
    if training:
        mu  = x.mean()                 # batch mean,  divide by m
        var = x.var(biased=True)       # batch variance, divide by m
        run_mean = (1-mom)*run_mean + mom*mu          # track for test
        run_var  = (1-mom)*run_var  + mom*unbiased_var(x)   # m/(m-1)
    else:
        mu, var = run_mean, run_var    # frozen population stats
    x_hat = (x - mu) / sqrt(var + eps) # standardize to mean 0, var 1
    return gamma * x_hat + beta        # learned scale + shift

(Most frameworks take a small shortcut: they keep an exponential moving average of the biased batch variance and skip the m/(m1)m/(m-1) correction at test time. A gap between paper and practice, not a mistake in either.) BN behaving differently in training and inference is by design, and that gap is the source of both its mild regularizing effect and the small-batch failure mode that drives Group and Layer Norm.

Why it lets you train faster

The most useful thing BN buys is a much higher learning rate without divergence. In a plain network a too-large rate makes gradients explode or vanish and the weights blow up. BN damps that through a scale invariance: multiply a layer's weights by any constant aa and the normalized output is unchanged, because standardizing divides the scale right back out.

BN(Wu)=BN((aW)u),BN((aW)u)u=BN(Wu)u,BN((aW)u)(aW)=1aBN(Wu)W\mathrm{BN}(Wu) = \mathrm{BN}\big((aW)u\big), \qquad \frac{\partial\,\mathrm{BN}((aW)u)}{\partial u} = \frac{\partial\,\mathrm{BN}(Wu)}{\partial u}, \qquad \frac{\partial\,\mathrm{BN}((aW)u)}{\partial (aW)} = \frac{1}{a}\cdot\frac{\partial\,\mathrm{BN}(Wu)}{\partial W}
(7)

The gradient to the layer below is untouched by the weight scale, and the gradient to the weights themselves shrinks by 1/a1/a. So if a step makes the weights larger, the next step's gradient on them is proportionally smaller, and weight growth becomes self-limiting. That is why a batch-normalized network tolerates a learning rate five, even thirty times higher than the un-normalized one without flying apart. This particular effect survives scrutiny: later work reads it as BN automatically tuning the effective learning rate as the weight norm grows (van Laarhoven 2017; Arora et al. 2018).

The paper offers a second, more theoretical reason, and it is the right place to be careful. It conjectures (the authors' word) that BN might push each layer's input-output Jacobian toward singular values near one, which would keep gradient magnitudes steady through depth. The paper itself flags this as unproven and "an area of further study," and later analysis went the other way: Yang et al. (2019) showed that at initialization BN drives those singular values away from one and can make gradients explode with depth. So treat the Jacobian story as an early guess that did not hold, not as an established mechanism.

The third reason is the one Figure 1 already made concrete: by keeping pre-activations in the steep region, BN makes saturating nonlinearities trainable again. A sigmoid network that was hopeless on its own becomes competitive once BN holds the inputs near the slope, with the ImageNet numbers below pinning down how much.

For intuition on the high-learning-rate claim, the two landscapes below are descended with one shared rate. The left, without BN, is jagged and high-curvature; the right, with BN, is a smooth bowl. Gradient descent on a quadratic stays stable only while the step is below 2/curvature2/\text{curvature}, so the jagged side diverges at a rate the smooth side still rides down. Drag the rate up to see it, then push it to the very top, where both diverge, since BN raises the ceiling rather than removing it. That smooth-versus-jagged picture is the modern reading of why BN helps, picked up again once the ICS story is unwound.

Figure 4 · a higher rate the smooth landscape can take
lr = 0.45
Two one-parameter loss landscapes descended at the same learning rate. Without BN the landscape is jagged (high curvature) and the ball overshoots and diverges; with BN it is a smooth bowl and the ball converges. Raise the rate and the gap appears; push it to the top and both diverge. Smoothing, not distribution-fixing, is the modern explanation.

14× fewer steps, and past human raters

The testbed is a variant of the Inception network on ImageNet, the 1000-class image-classification benchmark, with 13.6 million parameters, trained by momentum SGD at batch size 32. Batch Normalization goes before every nonlinearity, and the rest of the architecture is held fixed so the comparison is clean.

Adding BN alone (call it BN-Baseline) reaches Inception's accuracy in under half the steps and tops out a touch higher, 72.7%72.7\% against 72.2%72.2\%. Then the authors spent the headroom BN buys: they raised the learning rate 5×5\times, removed Dropout, cut the L2 weight penalty by 5×5\times, decayed the learning rate 6×6\times faster, and dropped local response normalization. That model, BN-x5, reaches 72.2%72.2\% in 2.12.1 million steps against Inception's 3131 million, the abstract's 14× fewer steps, about 7% of the training. In wall-clock terms that is the difference between a run that takes two weeks and one that takes a day. Push the learning rate to 30×30\times (BN-x30) and it trains a little slower at first but climbs higher, to 74.8%74.8\%.

Drag the target-accuracy line in the figure below. At 72.2%72.2\% the horizontal gap between BN-x5 and Inception is that 14×; raise the target and BN-x30 pulls ahead as the others flatten against their ceilings.

Figure 5 · steps to a target accuracy
72.2%
Validation accuracy vs training steps for Inception and three batch-normalized variants. Drag the dashed target line: where each curve crosses it is the steps needed to get there. At 72.2%, BN-x5 needs ~2.1M steps to Inception's ~31M, the 14×. Curves are smooth fits pinned to the paper's Table 2 anchors, not raw logs.

The saturating-sigmoid claim gets its number here too: BN-x5 with a sigmoid in place of ReLU reaches 69.8%69.8\%, while Inception with a sigmoid and no BN never escapes chance, one correct guess in a thousand. That gap comes entirely from keeping the pre-activations off the flat shoulders.

Combine six BN networks and the ensemble reaches 4.9%4.9\% top-5 error on the ImageNet validation set (4.82%4.82\% on the held-out test set), past the previous best results (GoogLeNet's ensemble at 6.67%6.67\%, Deep Image at 5.98%5.98\%, and a contemporaneous 4.94%4.94\%). The paper says this exceeds the accuracy of human raters, and that phrase deserves precision. The human number it beats is one carefully-trained annotator's estimate of about 5.1%5.1\% top-5 error (Russakovsky et al.); a second, less-practiced annotator scored around 12%12\%, and the annotator who set 5.1%5.1\% stressed that human accuracy sits on a speed-versus-effort tradeoff. So "past human raters" means edging past one trained human's best effort, not a robust superhuman result. With that caveat, the headline holds: the same architecture and the same parameter count, only the training changed, went from competitive to state of the art while training an order of magnitude faster.

What batch norm actually does

Batch Normalization plainly works. Why it works is a separate question, and the paper's own answer, internal covariate shift, is the part that did not survive. Three years on, Santurkar and colleagues tested the ICS story directly, and it failed three ways.

First, they added noise after each BN layer, deliberately re-introducing a large and constantly changing distributional shift, worse instability than a network with no normalization at all. If BN helped by stabilizing distributions, this should have wrecked it. The noisy-BN network trained almost as well as ordinary BN. Second, under a precise gradient-based definition of internal covariate shift, BN networks did not show less of it, and sometimes showed more, while still training far better; the link between ICS and optimization is, in their words, tenuous at best. Third, what BN does change is the optimization landscape: it makes the loss and its gradients smoother, so a gradient stays predictive over a longer step and larger learning rates remain stable. And the mean-zero, variance-one detail is not special, since other normalizations (L1, L2, L-infinity) that do not produce unit-Gaussian activations help comparably, which points at smoothing rather than distribution-fixing. (Scope it honestly: the smoothness theorems are proved for specific settings, and the result where L1 beats BN is for deep linear networks, not general ones.)

Be careful what that does and does not say. It does not say BN is useless or its gains are imaginary; they are large and real. It does not say internal covariate shift is fictional. It says the label on the box names the wrong mechanism. The Figure 4 picture, a jagged landscape made smooth, is closer to what is happening than the moving-distribution story the title leads with.

There is a second, more practical asterisk. BN ties every example's normalization to the rest of its batch. That coupling is what gives BN its mild regularizing noise: each example is seen alongside different neighbors every epoch, a little like Dropout, which is why the paper could remove Dropout and lose nothing. The same coupling is its weak spot. With small or skewed batches the per-batch statistics get noisy and the train-versus-inference gap widens, and accuracy falls; on ResNet-50 the error climbs from about 24%24\% at batch 32 to about 35%35\% at batch 2, where batch-independent successors like Group Normalization stay flat near 24%24\%. That single dependence on the batch is why later architectures, the Transformer above all, reach for Layer Normalization instead, normalizing across a layer's features within one example so there is no batch to depend on.

What survives is the operation, not the explanation. Standardize each feature over the batch, hand it back a learned scale and shift, differentiate through the batch statistics so backprop carries them, and freeze them at test time. That one layer let networks train at learning rates that used to be suicidal, revived the saturating nonlinearities everyone had abandoned, and turned a strong image model into a faster and better one without adding anything to the architecture beyond two numbers per feature. The mechanism named in the title is the part the field stopped believing. The layer it produced is in almost everything since.

Provenance Verified against primary literature
Santurkar et al. (2018)Internal covariate shift does not explain BN; the active effect is a smoother loss landscape.
Yang et al. (2019)Disproves the Sec 3.3 conjecture that BN keeps layer-Jacobian singular values near 1.
Wu & He (2018) · Ioffe (2017)BN degrades at small batches; Group Norm and Batch Renorm address the batch dependence.
Shimodaira (2000)Origin of “covariate shift,” which the paper extends to a network’s internal layers.
Glorot & Bengio (2010)Xavier initialization (the paper prints the author order reversed, as “Bengio & Glorot”).
correctionThe paper’s headline mechanism, internal covariate shift, was later shown not to be why batch norm works (Santurkar et al. 2018): you can re-inject heavy distributional shift after a BN layer and it still trains fine. We present ICS as the original, now-contested motivation and add the modern loss-smoothing account alongside it. We also keep the Section 3.3 “singular values near 1” claim marked as a conjecture the paper hedges and that Yang et al. (2019) disproved, and we keep training on the biased (÷m) variance while inference uses the unbiased (÷(m−1)) one, as the paper specifies.

Questions you might still have

?

If batch norm doesn’t reduce internal covariate shift, why does it work?
Later analysis (Santurkar et al. 2018) points at the optimization landscape: BN makes the loss and its gradients smoother, so a larger step still lands somewhere sensible. That lets you raise the learning rate, which is most of the speedup. The mean-0/variance-1 detail is not special; other normalizations help comparably.

?

Why does the layer need γ and β if normalization already standardizes everything?
Standardizing to mean 0, variance 1 would pin a sigmoid to its near-linear middle and throw away useful range. γ and β let the layer put any mean and spread back, and with γ = √Var[x], β = E[x] they recover the original activation exactly, so normalization never costs representational power.

?

Why use the biased variance in training but the unbiased one at test time?
During training you must normalize by the same statistic the gradient is computed through, which is the biased (÷m) batch variance. At inference you want the best estimate of the true population variance, so you use the unbiased (÷(m−1)) correction. Many frameworks instead keep a moving average of the biased variance and skip the correction; a practice gap, not an error.

?

Why does batch norm break with small batch sizes?
Each example’s output depends on the rest of its batch through the shared mean and variance. With few examples those statistics are noisy, and the gap between batch statistics (training) and frozen statistics (inference) widens, so accuracy drops. Batch-independent successors like Group and Layer Normalization avoid the batch entirely.

Footnotes & further reading

  1. The paper: Ioffe & Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (Google, ICML 2015).
  2. The mechanism reckoning: Santurkar, Tsipras, Ilyas & Madry, How Does Batch Normalization Help Optimization? (NeurIPS 2018), the noise-injection experiment and the loss-smoothing account.
  3. The Section 3.3 conjecture, disproved at initialization: Yang, Pennington, Rao, Sohl-Dickstein & Schoenholz, A Mean Field Theory of Batch Normalization (ICLR 2019).
  4. The small-batch failure and batch-free fixes: Wu & He, Group Normalization; Ioffe, Batch Renormalization; and Ba, Kiros & Hinton, Layer Normalization.
  5. The scale-invariance descendant, effective-learning-rate tuning: van Laarhoven, L2 Regularization versus Batch and Weight Normalization, and Arora, Li & Lyu, Theoretical Analysis of Auto Rate-Tuning by Batch Normalization.
  6. Foundations the paper builds on: Shimodaira (2000) for the term "covariate shift," Glorot & Bengio (2010) for Xavier initialization, and the human-rater estimate from Russakovsky et al. (2014) (the ImageNet challenge paper).