VerifiedarXiv:1312.611430 min
Generative models · Variational inference

Auto-Encoding Variational Bayes

How to backpropagate through a random sample.

A latent-variable model can dream up new data, but the likelihood you would train it on is an integral nobody can compute. The variational autoencoder gets around that with one trick: it rewrites a random sample so the gradient can flow straight through it, and a generative model you can train with plain backprop falls out.

Explaining the paperAuto-Encoding Variational BayesKingma, Welling · Universiteit van Amsterdam · ICLR 2014 · arXiv:1312.6114

What does it take to turn an autoencoder, a thing that just copies its input, into a model that invents new data you have never seen?

Here is the dream that organizes a lot of generative modeling. You would like a machine that can produce new examples of something. New handwritten digits, new faces, new molecules. The cleanest way to set that up is a latent-variable model. Pick a hidden code z\mathbf{z} from a simple distribution, push it through a network, and read out a datapoint x\mathbf{x}. The network learns to turn easy random codes into hard, structured data. To make a new sample you draw a fresh code and decode it. That is the whole generative story, and it is simple.

The trouble is training it. To fit the network you want to maximize the probability the model assigns to your real data. Writing that probability down is easy. Computing it is not, because it hides an integral over every possible hidden code, and that integral has no closed form and far too many codes to sum over. The exact same integral blocks the other thing you would reach for, which is figuring out which code produced a given datapoint. Both the headline likelihood and the bookkeeping you would use to get it are out of reach.

Auto-Encoding Variational Bayes, by Diederik Kingma and Max Welling, is the paper that made this work at scale. It does two things. It optimizes a tractable lower bound instead of the impossible likelihood, and it trains a small second network to guess the hidden code, so you never have to solve for it. When both networks are neural nets fit jointly, the result is the variational autoencoder. To see why it works you only need a short stack of ideas: the intractable integral, the lower bound that dodges it, the autoencoder that the bound turns into, and the rewrite that lets a random sampling step carry a gradient.

The data, and the wall

Start with the generative model itself. There is a hidden code z\mathbf{z}, drawn from a fixed, featureless prior, a standard Gaussian p(z)=N(0,I)p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I}) with no parameters to learn. Then a decoder network with weights θ\boldsymbol{\theta} turns that code into a distribution over data, pθ(xz)p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}). Sampling is a two-step recipe: draw z\mathbf{z}, then draw x\mathbf{x} from the decoder. A simple cloud goes in, structured data comes out.

Drag nothing, just watch: simple codes on the left stream through the decoder and land on the data.

Figure 1 · the generative model
A code z drawn from a plain Gaussian prior is pushed through the decoder and lands on the thin data manifold. Generation is exactly this: sample a simple cloud, decode it onto the structured sheet where real data lives. The intelligence is all in the map, not in the cloud.

Now the question that trains it. How probable is one real datapoint x\mathbf{x} under this model? You have to account for every code that could have produced it, weighting each by how likely that code was to begin with. That sum over codes is the marginal likelihood, also called the evidence:

pθ(x)=pθ(z)pθ(xz)dzp_{\boldsymbol{\theta}}(\mathbf{x}) = \int p_{\boldsymbol{\theta}}(\mathbf{z})\, p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})\, d\mathbf{z}

Read it as an average of the decoder over the whole prior. For any single pair (x,z)(\mathbf{x},\mathbf{z}) the thing inside the integral, the joint pθ(x,z)=pθ(z)pθ(xz)p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z}) = p_{\boldsymbol{\theta}}(\mathbf{z})\,p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}), is cheap to evaluate: run the decoder, multiply by the prior density. The integral is what kills you. The code z\mathbf{z} lives in a continuous, high-dimensional space, the decoder is a nonlinear network, and there is no antiderivative and no small grid of codes that covers it. So the evidence is intractable. You cannot evaluate it, and you certainly cannot differentiate it to do gradient ascent.

It gets worse in a useful way. By Bayes' rule, the posterior, the distribution over codes that could have produced a given x\mathbf{x}, is

pθ(zx)=pθ(xz)pθ(z)pθ(x)p_{\boldsymbol{\theta}}(\mathbf{z}\mid\mathbf{x}) = \frac{p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})\, p_{\boldsymbol{\theta}}(\mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{x})}

and the denominator is that same impossible integral. The numerator is the cheap joint; the normalizer is the wall. So the posterior is intractable too. This is why the classical tool for latent-variable models, the EM algorithm, stalls here: its E-step needs the exact posterior to fill in the hidden codes, and we just said we do not have it. (The usual workaround, mean-field variational inference, assumes a simple factorized posterior it can solve for in closed form, but that closed form needs a conjugate likelihood, which a neural network is not.) One integral, three dead ends.

Everything the paper does is a way around that one integral. The move is to stop demanding the exact posterior and instead learn a cheap stand-in for it, a second network, and to optimize a bound on the evidence that this stand-in makes computable. That stand-in has a name in the paper, the recognition model qϕ(zx)q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}), an encoder: feed it a datapoint and it returns a distribution over the codes that might have made it.

A bound you can actually optimize

We cannot touch logpθ(x)\log p_{\boldsymbol{\theta}}(\mathbf{x}). But there is an exact identity that splits it into two pieces, and one of those pieces is something we can compute. Introduce the encoder qϕ(zx)q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) and the log-evidence decomposes as

logpθ(x)=DKL ⁣(qϕ(zx)pθ(zx))+L(θ,ϕ;x)\log p_{\boldsymbol{\theta}}(\mathbf{x}) = D_{\mathrm{KL}}\!\big(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) \,\|\, p_{\boldsymbol{\theta}}(\mathbf{z}\mid\mathbf{x})\big) + \mathcal{L}(\boldsymbol{\theta},\boldsymbol{\phi};\mathbf{x})(1)

The first term is a KL divergence, a measure of how far one distribution is from another. It measures the gap between our encoder qϕq_{\boldsymbol{\phi}} and the true posterior pθ(zx)p_{\boldsymbol{\theta}}(\mathbf{z}\mid\mathbf{x}). A KL divergence is never negative, and it is zero only when the two distributions match exactly. That single fact is the whole lever. Because the left side logpθ(x)\log p_{\boldsymbol{\theta}}(\mathbf{x}) is a fixed number once θ\boldsymbol{\theta} is fixed, and the KL gap on the right cannot go below zero, the second term L\mathcal{L} can never exceed the evidence. It is a lower bound on logpθ(x)\log p_{\boldsymbol{\theta}}(\mathbf{x}), and the paper calls it the variational lower bound. Today it is usually called the evidence lower bound, or ELBO.

So we get a bound for free, and pushing the bound up is productive in two ways at once. Hold the decoder fixed and improve the encoder: the bound rises, and since the total is pinned, the only thing that can give is the KL gap, which shrinks. Better bound and a better posterior approximation, from the same gradient step. The bound touches the evidence exactly when the encoder equals the true posterior, the moment the gap hits zero. We never reach it with a limited encoder, but we get as close as the encoder can manage.

Drag the encoder toward the true posterior and watch the gap close. The full bar is the (fixed) evidence; the teal part is the bound you are lifting, the amber part is the slack you are squeezing out.

Figure 2 · the bound and the gap
28%
Top: the true posterior and your encoder q; raising the quality slides q onto it. Bottom: the full bar is the evidence logp(x)\log p(\mathbf{x}), split into the ELBO you maximize and the KL gap to the true posterior. Close the gap and the bound becomes tight. The gap is the price of using an approximate encoder.

What is this bound, concretely? Two equivalent ways to write it, and you will use both. The paper's first form keeps the joint together:

L(θ,ϕ;x)=Eqϕ(zx) ⁣[logqϕ(zx)+logpθ(x,z)]\mathcal{L}(\boldsymbol{\theta},\boldsymbol{\phi};\mathbf{x}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})}\!\big[\,{-\log q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})} + \log p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})\,\big](2)

If you have met Jensen's inequality, this is also where the bound comes from without any mention of a posterior: write logpθ(x)\log p_{\boldsymbol{\theta}}(\mathbf{x}) as logEq ⁣[pθ(x,z)/qϕ]\log \mathbb{E}_{q}\!\left[\,p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})/q_{\boldsymbol{\phi}}\,\right], and since log\log is concave you can pull it inside the average to get a smaller number, Eq ⁣[log(pθ(x,z)/qϕ)]\mathbb{E}_{q}\!\left[\log\big(p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})/q_{\boldsymbol{\phi}}\big)\right]. That is eq (2) again, and the size of Jensen's slack is exactly the KL gap from eq (1). Two doors into the same room.

The second form is the one that makes everything compute. Split the joint pθ(x,z)=pθ(xz)p(z)p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z}) = p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})\,p(\mathbf{z})and collect terms:

L(θ,ϕ;x)=Eqϕ(zx) ⁣[logpθ(xz)]reconstruction    DKL ⁣(qϕ(zx)p(z))regularizer\mathcal{L}(\boldsymbol{\theta},\boldsymbol{\phi};\mathbf{x}) = \underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})}\!\big[\log p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})\big]}_{\text{reconstruction}} \; - \; \underbrace{D_{\mathrm{KL}}\!\big(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) \,\|\, p(\mathbf{z})\big)}_{\text{regularizer}}(3)

A warning that has tripped up many readers: this KL is not the one from eq (1). Eq (1) measured distance to the true posterior pθ(zx)p_{\boldsymbol{\theta}}(\mathbf{z}\mid\mathbf{x}), the intractable thing. This one measures distance to the prior p(z)=N(0,I)p(\mathbf{z}) = \mathcal{N}(\mathbf{0},\mathbf{I}), which is fixed and known. Same symbol, different second argument, and only this one is computable. The first KL is the bound's invisible slack; this second KL is the term you actually evaluate.

One bookkeeping note you have to keep straight, because it is the single most common sign error in this whole subject. The paper maximizes L\mathcal{L} by gradient ascent, and writes the regularizer with a minus sign, as in eq (3). Modern code minimizes a loss, and the loss is just the negative ELBO, so the same KL flips to a positive penalty:

loss=L=DKL ⁣(qϕ(zx)p(z))Eqϕ ⁣[logpθ(xz)]\text{loss} = -\mathcal{L} = D_{\mathrm{KL}}\!\big(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) \,\|\, p(\mathbf{z})\big) - \mathbb{E}_{q_{\boldsymbol{\phi}}}\!\big[\log p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})\big]

Maximize the bound or minimize its negative, it is the same training. But you have to know which one is on screen before you trust any sign. From here on, when a quantity has a sign, the text says whether we are maximizing the bound or minimizing the loss.

An autoencoder falls out

Read eq (3) as a machine rather than a formula. The encoder qϕq_{\boldsymbol{\phi}} takes a datapoint x\mathbf{x} and produces a distribution over codes. You draw a code z\mathbf{z} from it. The decoder pθp_{\boldsymbol{\theta}} takes that code and tries to rebuild x\mathbf{x}. The first term in eq (3) rewards a good rebuild, and the second keeps the codes tidy. That is an autoencoder, an encoder feeding a decoder through a bottleneck, with one twist: the bottleneck is a distribution you sample from, not a single point, and a regularizer keeps that distribution honest.

Figure 3 · the variational autoencoder
The whole machine. An input x is encoded to a Gaussian q(zx)q(\mathbf{z}\mid\mathbf{x}) (the dashed ellipse), a code z is sampled with fresh noise ε\varepsilon, and the decoder rebuilds it as . Two forces train it: the top bracket pulls x̂ toward x (reconstruction), the latent term pulls q toward the prior (the KL regularizer).

The two terms pull against each other, and that tension is the point. The reconstruction term wants codes that carry as much about x\mathbf{x} as possible, ideally a sharp, distinctive code per datapoint so the decoder can rebuild it perfectly. The KL term wants every code distribution to look like the same bland prior, which blurs the codes together and throws information away. A plain autoencoder, with only the reconstruction term, is free to scatter its codes anywhere it likes. It will, and the space between codes decodes to garbage, which is why you cannot sample from a plain autoencoder. The KL term is what makes the latent space filled in: every encoding is a little Gaussian parked near the origin, the encodings overlap and cover the prior, and so a fresh draw from N(0,I)\mathcal{N}(\mathbf{0},\mathbf{I}) lands somewhere the decoder actually knows. The regularizer is not a tax on the model. It is the thing that makes generation possible at all.

The encoder also solves the problem from the last section. We needed the posterior and could not compute it. Instead of solving for it per datapoint, we train one network that predicts an approximate posterior for any x\mathbf{x} in a single forward pass. That shortcut has a name, amortized inference: the cost of inference is paid once, in the encoder's weights ϕ\boldsymbol{\phi}, and then reused for every datapoint, instead of running a fresh optimization for each one the way classical variational inference does. The payoff lands at test time: encoding a new, unseen x\mathbf{x} is one forward pass through ϕ\boldsymbol{\phi}, not an optimization, which is what makes the encoder fast and reusable.

The KL, in closed form

One of the two terms is a gift. If the encoder is a diagonal Gaussian, the standard VAE choice, then for each datapoint it outputs a mean vector μ\boldsymbol{\mu} and a vector of standard deviations σ\boldsymbol{\sigma}:

logqϕ(zx)=logN ⁣(z;μ,σ2I)\log q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) = \log \mathcal{N}\!\big(\mathbf{z};\, \boldsymbol{\mu},\, \boldsymbol{\sigma}^{2}\mathbf{I}\big)(9)

Both the prior and this qq are Gaussian, so the KL between them has a closed form. You do not sample it, you do not approximate it, you just evaluate it. With the prior N(0,I)\mathcal{N}(\mathbf{0},\mathbf{I}), summing over the JJ latent dimensions, the negative KL (the form the paper adds to its maximized bound) is

DKL ⁣(qϕ(zx)p(z))=12j=1J(1+log(σj2)μj2σj2)-D_{\mathrm{KL}}\!\big(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) \,\|\, p(\mathbf{z})\big) = \tfrac{1}{2}\sum_{j=1}^{J}\Big(1 + \log(\sigma_j^{2}) - \mu_j^{2} - \sigma_j^{2}\Big)App. B

Look at what it wants. This expression is at its largest, exactly zero, when every μj\mu_j is 00 and every σj\sigma_j is 11, which is to say when qq is the prior. Anywhere else it is negative, and the further qq drifts from the prior, the more negative it gets. A mean far from the origin is penalized through μj2-\mu_j^{2}; a variance that is too large or too small is penalized through log(σj2)σj2\log(\sigma_j^2) - \sigma_j^2. In a minimized loss every sign flips and you penalize the positive KL instead, the form you will meet in code:

DKL ⁣(qϕ(zx)p(z))=12j=1J(μj2+σj2log(σj2)1)D_{\mathrm{KL}}\!\big(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) \,\|\, p(\mathbf{z})\big) = \tfrac{1}{2}\sum_{j=1}^{J}\big(\mu_j^{2} + \sigma_j^{2} - \log(\sigma_j^{2}) - 1\big)

It is a soft spring tethering each code distribution to the standard Gaussian, fought by the reconstruction term that wants the codes spread out and informative.

Move the encoder's output for one latent dimension and watch the spring. The teal bell is qq, the amber bell the fixed prior; the landscape below is the penalty, brightest where it vanishes.

Figure 4 · the KL regularizer
μ = -1.10
σ = 1.80
The regularizer for one latent coordinate. The encoder q = N(μ,σ2)\mathcal{N}(\mu,\sigma^2) is pulled toward the fixed prior N(0,1)\mathcal{N}(0,1). The heat map is DKL-D_{\mathrm{KL}} over the (μ,σ)(\mu,\sigma) plane, brightest at (0,1)(0,1) where it is zero and the bells coincide. Every other setting is penalized.

With the KL handled exactly, only one term in the bound still involves an expectation you cannot integrate: the reconstruction Eqϕ(zx) ⁣[logpθ(xz)]\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})}\!\big[\log p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})\big]. Drawing samples estimates its value well enough. The obstacle is its gradient, and that is where the real difficulty hides.

Backprop through a random sample

Write the reconstruction's integrand as f(z)=logpθ(xz)f(\mathbf{z}) = \log p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}). We need the gradient of Eqϕ(zx)[f(z)]\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})}[f(\mathbf{z})] with respect to the encoder's parameters ϕ\boldsymbol{\phi}. (The decoder θ\boldsymbol{\theta} is the easy half: it sits inside ff at a fixed code, so its gradient passes straight through the sample, the way any gradient does. The hard half is ϕ\boldsymbol{\phi}, which controls which codes get drawn at all.) And that is the subtlety: ϕ\boldsymbol{\phi} does not sit inside ff, it sits inside the distribution we are averaging over. Change ϕ\boldsymbol{\phi} and you change which codes z\mathbf{z} get sampled, not the value of ff at a fixed code. You cannot push the derivative through a sampling step the way you would through a fixed sum.

There is a textbook estimator that handles this, the score-function estimator (also called REINFORCE, or the likelihood-ratio estimator):

ϕEqϕ(z) ⁣[f(z)]=Eqϕ(z) ⁣[f(z)ϕlogqϕ(z)]\nabla_{\boldsymbol{\phi}}\, \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})}\!\big[f(\mathbf{z})\big] = \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})}\!\big[\,f(\mathbf{z})\,\nabla_{\boldsymbol{\phi}} \log q_{\boldsymbol{\phi}}(\mathbf{z})\,\big]

It is correct. It is unbiased, and it even works for discrete codes. But it is famously noisy, and the paper dismisses it in one line for being too high-variance to use. The reason is in its shape. The cost enters multiplicatively: each estimate is f(z)f(\mathbf{z}) times a score, so its magnitude and its noise scale with how big ff is in absolute terms. Add a constant to ff and the true gradient does not move, yet the estimator gets noisier, because it never uses how ff changes with the code, only its raw height. It learns by correlation: codes that happened to score well should be a little more likely. With one sample per datapoint that correlation is mostly noise.

The fix is the paper's headline idea, the reparameterization trick. Instead of drawing z\mathbf{z} directly from qϕq_{\boldsymbol{\phi}}, write the same draw as a deterministic function of the encoder's outputs and a fixed, parameter-free noise source. For the Gaussian encoder that is just

z=μ+σε,εN(0,I)\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\varepsilon}, \qquad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I})(4)

where \odot is the elementwise product. The randomness now lives entirely in ε\boldsymbol{\varepsilon}, which does not depend on ϕ\boldsymbol{\phi} at all. The encoder's job is the deterministic part, the shift μ\boldsymbol{\mu} and the scale σ\boldsymbol{\sigma}. And because z\mathbf{z} is now a plain differentiable function of μ\boldsymbol{\mu} and σ\boldsymbol{\sigma}, the gradient of f(z)f(\mathbf{z}) flows straight back through it into the encoder, using ff's own slope at the drawn code. That is the difference that kills the variance: this estimate depends on how steeply ff leans near z\mathbf{z}, never on its raw height, so a smooth cost already points the right way from a single sample. Same distribution of z\mathbf{z}, completely different gradient.

Toggle the two wirings. The forward pass is identical; the backward pass is the whole story.

Figure 5 · moving the randomness out of the way
Sampling z directly puts a random node on the path, and you cannot push a gradient back through "draw a sample," so the encoder's μ,σ\mu,\sigma get nothing (the red break). Rewrite the draw as z=μ+σε\mathbf{z} = \mu + \sigma\,\varepsilon with the noise ε\varepsilon supplied from outside, and the gradient flows all the way back to μ,σ,ϕ\mu,\sigma,\phi. Same samples, trainable encoder.

The slogans usually skip two caveats that matter. Both estimators are unbiased for the same gradient; they agree in expectation. What the trick changes is variance, not correctness. And lower variance is a strong, reliable tendency, not a theorem: for a cost that varies very sharply with the code, the pathwise estimator can in principle be the noisier one. For the smooth reconstruction losses VAEs use, it wins comfortably, which is the whole reason a single sample per datapoint is enough to train.

How much does it actually help?

You can see the variance gap instead of taking it on faith. Take a one-dimensional toy: a code zN(μ,σ2)z \sim \mathcal{N}(\mu, \sigma^2) and a smooth cost f(z)=(za)2f(z) = (z-a)^2, a stand-in for the reconstruction ff from before, and estimate the one gradient μE[f]\partial_{\mu}\, \mathbb{E}[f], whose true value is 2(μa)2(\mu - a). Each estimator gets the same random draws. Drag the sample count up: both clouds of estimates converge on the true gradient (both are unbiased), but the score-function cloud stays splattered across a wide band while the reparameterized one sits in a tight knot around the answer.

Figure 6 · same target, different spread
N = 220
Single-sample gradient estimates for the two methods on a shared axis centered at the true gradient. The score-function estimates scatter wide (large standard deviation, some off the axis); the reparameterized estimates cluster tight. Both running means converge to the true value as you add samples, so both are unbiased. Only the spread differs, and the spread is what one sample fights.

That tightness is what makes the whole scheme practical. The paper's estimator uses a single code per datapoint, and it trains, because at a minibatch of a hundred the reparameterized gradients are already clean enough to point the right way.

The whole algorithm

Now assemble the pieces into the objective the paper actually optimizes. Take the bound in its reconstruction- minus-KL form (3), keep the KL analytic, and estimate only the reconstruction by sampling codes through the reparameterization. For a single datapoint with a Gaussian encoder this is the paper's eq (10):

L(θ,ϕ;x)12j=1J(1+log(σj2)μj2σj2)+1Ll=1Llogpθ ⁣(xz(l))\mathcal{L}(\boldsymbol{\theta},\boldsymbol{\phi};\mathbf{x}) \simeq \tfrac{1}{2}\sum_{j=1}^{J}\Big(1 + \log(\sigma_j^{2}) - \mu_j^{2} - \sigma_j^{2}\Big) + \frac{1}{L}\sum_{l=1}^{L} \log p_{\boldsymbol{\theta}}\!\big(\mathbf{x}\mid\mathbf{z}^{(l)}\big)(10)

with z(l)=μ+σε(l)\mathbf{z}^{(l)} = \boldsymbol{\mu} + \boldsymbol{\sigma}\odot\boldsymbol{\varepsilon}^{(l)} and ε(l)N(0,I)\boldsymbol{\varepsilon}^{(l)} \sim \mathcal{N}(\mathbf{0},\mathbf{I}). This whole quantity is a single differentiable number. What you just built, the analytic KL plus a sampled reconstruction, is the paper's lower-variance estimator, SGVB B. (There is also a fully generic SGVB A that Monte-Carlo-estimates both terms, for a non-Gaussian prior or encoder where the KL has no closed form.) Differentiate it with respect to both θ\boldsymbol{\theta} and ϕ\boldsymbol{\phi} at once and take a gradient step. To scale to a large dataset you do this on a minibatch and rescale, eq (8):

L(θ,ϕ;X)    NMi=1ML(θ,ϕ;x(i))\mathcal{L}(\boldsymbol{\theta},\boldsymbol{\phi};\mathbf{X}) \;\simeq\; \frac{N}{M}\sum_{i=1}^{M} \mathcal{L}\big(\boldsymbol{\theta},\boldsymbol{\phi};\mathbf{x}^{(i)}\big)(8)

The paper finds L=1L = 1 sample per datapoint is enough as long as the minibatch is reasonably large, around M=100M = 100, because the minibatch already averages the per-sample noise. So in practice you draw exactly one ε\boldsymbol{\varepsilon} per datapoint. Written out, one training step is short:

# one AEVB step on a minibatch X of M datapoints; prior p(z) = N(0, I)
mu, logvar = encoder(X)                    # q(z|x) = N(mu, exp(logvar))
eps        = randn_like(mu)                # external noise, no gradient
z          = mu + exp(0.5 * logvar) * eps  # reparameterize  (eq 4)
x_hat      = decoder(z)                     # p(x|z)
recon = log_p_x_given_z(X, x_hat)          # +log-likelihood (eq 11 / 12)
kl    = -0.5 * sum(1 + logvar - mu**2 - exp(logvar))   # = +KL  (eq 10)
loss  = mean(kl - recon)                   # minimize -ELBO  (= -L)
loss.backward(); opt.step()                # grad -> encoder AND decoder

The listing hides two details. The encoder does not output σ\boldsymbol{\sigma} directly; it outputs logσ2\log \boldsymbol{\sigma}^2, the log-variance, and you recover σ=exp(12logσ2)\boldsymbol{\sigma} = \exp(\tfrac{1}{2}\log\boldsymbol{\sigma}^2). That keeps the variance positive for free and behaves under gradients. And the single loss.backward() sends gradients into the decoder θ\boldsymbol{\theta} through the reconstruction term and into the encoder ϕ\boldsymbol{\phi} through both terms, the KL directly and the reconstruction through the reparameterized code. One backward pass trains the entire model.

Make it concrete. Say x\mathbf{x} is a binarized MNIST digit, a 784-dimensional vector of zeros and ones. The encoder is an MLP with a 500-unit hidden layer that reads those 784 numbers and outputs two 20-dimensional vectors, μ\boldsymbol{\mu} and logσ2\log\boldsymbol{\sigma}^2, so the latent is J=20J = 20 dimensional. Draw one εR20\boldsymbol{\varepsilon} \in \mathbb{R}^{20} from N(0,I)\mathcal{N}(\mathbf{0},\mathbf{I}), form z=μ+σε\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma}\odot\boldsymbol{\varepsilon}, and a decoder MLP (also one 500-unit hidden layer) maps that 20-vector back to the reconstruction the pipeline figure called x^\hat{\mathbf{x}}, here a vector of 784 per-pixel Bernoulli means y^(0,1)784\hat{\mathbf{y}} \in (0,1)^{784}. The reconstruction term is the Bernoulli log-likelihood of the real pixels under those probabilities, eq (11):

logpθ(xz)=i=1784[xilogy^i+(1xi)log(1y^i)]\log p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}) = \sum_{i=1}^{784} \Big[\, x_i \log \hat{y}_i + (1-x_i)\log(1-\hat{y}_i) \,\Big](11)

and the KL term is the 20-term positive sum from above. The loss is that KL minus this reconstruction, a single scalar, and backprop updates both MLPs. (For continuous data like the Frey Face images the decoder instead outputs a mean and a variance and the reconstruction is a Gaussian log-density, eq (12); everything else is identical.) Once it is trained you can throw the encoder away and just generate:

# generation: no encoder, no data, just ancestral sampling
z     = randn(n, J)        # z ~ prior N(0, I)
x_hat = decoder(z)         # p(x|z): a brand-new sample, never seen in training

Sample a code from the prior, run the decoder, and you have a new datapoint the model invented. The encoder existed only to make training tractable. At generation time the latent space is the prior, and the decoder is the artist.

A latent space you can walk

The regularizer promised something the reconstruction alone could not: a latent space with no holes, where every point decodes to something sensible and nearby points decode to similar things. That is not just a sampling convenience, it is a learned, organized coordinate system for the data. The paper makes the point with a picture. Train a VAE with a two-dimensional latent so you can see all of it, then sweep a grid of codes across the plane and decode each one.

Move the cursor through the latent plane. The decoded digit morphs smoothly, with no seams and no gaps.

Figure 7 · the learned manifold
A grid of codes z\mathbf{z} across the 2-D latent plane, each decoded to a reconstruction. The shapes morph smoothly from cell to neighbouring cell, with no abrupt jumps and no dead spots: the KL regularizer organized the whole plane. The inset decodes the exact code under the cursor. The paper's Figure 4 is this picture on real MNIST.

The figure invites one careful detail. To make a grid that matches the prior, the paper (unlike this evenly-spaced demo grid) does not space the codes uniformly. It lays a regular grid on the unit square and passes it through the inverse Gaussian CDF, so the grid is dense where the prior puts its mass and sparse in the tails. That is the honest way to tile a Gaussian latent. The takeaway is the smoothness: because every datapoint was encoded to a Gaussian pinned near the origin, the codes pack together and overlap, and the decoder learns a continuous map off of them. Walking a straight line in latent space gives you a smooth interpolation in data space, which is the property that made VAEs a workhorse for representation learning.

What it actually showed

The paper is from 2013, and its results read best in their own terms, not through a modern lens. There is no FID, no Inception score, no CIFAR or ImageNet; those came years later. The datasets are MNIST and the Frey Face video frames. The metric is the variational lower bound itself, measured in nats, plus an estimate of the marginal likelihood for a small latent. The optimizer is Adagrad, not Adam, which Kingma would only publish the following year.

The comparison is against the two reasonable alternatives of the day, the wake-sleep algorithm and Monte Carlo EM with a Hybrid Monte Carlo sampler. On both datasets AEVB converges faster and reaches a better bound, and it keeps scaling where Monte Carlo EM, which needs an expensive sampling loop per datapoint, cannot follow onto the full dataset. The settings are modest by today's standards: minibatches of M=100M = 100, one latent sample each, 500 hidden units for MNIST and 200 for the smaller Frey Face set, weights initialized from N(0,0.01)\mathcal{N}(0, 0.01).

One finding surprised the authors. Adding more latent dimensions did not cause overfitting. A plain autoencoder with a fat bottleneck would happily memorize; this one did not, because the KL term regularizes the model for free, and superfluous latent dimensions just collapse back onto the prior and stop carrying information. The bound polices its own capacity.

What the trick opened up is larger than the experiments. The reparameterization estimator turned variational inference into something you could drop into any gradient-based pipeline, and the years after this paper are a long list of things built on it. There are honest limits, most of which became their own research programs. The encoder is only an approximation, so there is an amortization gap between it and the best per-datapoint posterior, which later work attacks with tighter bounds and richer encoders. The KL term can be too eager and shut a latent dimension off entirely, the failure called posterior collapse; turning its weight down with a single coefficient gives the β-VAE. The trick needs a continuous code, so discrete latents had to wait for the score-function estimator or a relaxation like Gumbel-Softmax. And the Gaussian decoder returns a blurry mean, which is part of why VAEs alone make soft samples.

None of that has dislodged the core idea. A latent-variable model has an intractable likelihood. Bound it, learn an encoder to approximate the posterior, and reparameterize the one sampling step so the whole thing is differentiable. That recipe is still how a great deal of modern generative modeling is trained, right down to the autoencoder that compresses images into the latent space that diffusion models like Stable Diffusion actually run in. The variational autoencoder was one of the first times you could take the clean probabilistic story, sample a code and decode it, and train it end to end with nothing fancier than backprop.

Provenance Verified against primary literature
Kingma & Welling (2013)The SGVB estimator, the reparameterization trick, and the AEVB algorithm.
Bishop, PRML ch. 9The exact decomposition log p(x) = ELBO + KL gap, and why EM needs the posterior.
Mohamed et al. (2020)Score-function vs pathwise gradients: both unbiased, very different variance.
Kingma & Welling (2019)The log-variance parameterization and the minimize-the-negative-ELBO convention.
correctionThe paper maximizes the ELBO and writes the prior KL with a minus sign. Modern code minimizes the negative ELBO, so the same KL is a positive penalty. We show both and label which sign is on screen.

Questions you might still have

?

If we never compute log p(x), how do we know training improves it?
We climb the ELBO, a lower bound on log p(x). Raising it either lifts log p(x) itself or closes the gap to the true posterior (the gap is exactly a KL). Either way the model and its encoder both get better.

?

Why add noise to the code at all? A plain autoencoder reconstructs better.
A plain autoencoder can scatter codes anywhere and leave gaps that decode to nothing. The KL term forces every encoding to be a Gaussian sitting near the prior, so the whole latent space decodes to something. That is what makes sampling and interpolation work.

?

Both gradient estimators are unbiased. So why does variance matter?
With one sample per datapoint (the paper uses L = 1), a high-variance gradient is mostly noise, and you would need far more samples or steps to average it away. The reparameterized estimator is tight enough that a single sample already trains.

?

Does the trick work for discrete latents, like words or tokens?
Not directly. z = mu + sigma·eps needs a continuous, differentiable path. Discrete latents fall back on the score-function estimator or a continuous relaxation (Gumbel-Softmax), which came years later.

Footnotes & further reading

  1. The paper: Kingma & Welling, Auto-Encoding Variational Bayes (Universiteit van Amsterdam, ICLR 2014).
  2. The ELBO / EM decomposition: Bishop, Pattern Recognition and Machine Learning, ch. 9, and Blei, Kucukelbir & McAuliffe, Variational Inference: A Review for Statisticians.
  3. Score-function vs pathwise gradients (both unbiased, different variance): Mohamed, Rosca, Figurnov & Mnih, Monte Carlo Gradient Estimation in Machine Learning.
  4. The same reparameterized estimator, developed independently and concurrently: Rezende, Mohamed & Wierstra, Stochastic Backpropagation and Approximate Inference in Deep Generative Models.
  5. The modern reference, with the log-variance convention and the loss framing: Kingma & Welling, An Introduction to Variational Autoencoders.
  6. What came next: tighter bounds (Burda et al., IWAE), discrete latents (Jang et al., Gumbel-Softmax), disentangling (Higgins et al., β-VAE), and the autoencoder under latent diffusion (Stable Diffusion).