Generative models · Variational inference

Auto-Encoding Variational Bayes

How to backpropagate through a random sample.

A latent-variable model can generate new data, but the likelihood you would train it on is an integral nobody can compute. The variational autoencoder gets around that with one trick: it rewrites a random sample so the gradient can flow straight through it, and you can train the generative model with plain backprop.

Explaining the paperAuto-Encoding Variational BayesKingma, Welling · Universiteit van Amsterdam · ICLR 2014 · arXiv:1312.6114 ↗

An autoencoder only copies its input. Rewriting the sampling step turns it into a model that invents data you have never seen.

A lot of generative modeling is organized around one wish: a machine that can produce new examples of something, whether handwritten digits, faces, or molecules. The cleanest way to set that up is a latent-variable model, in which you pick a hidden code $\mathbf{z}$ from a simple distribution, push it through a network, and read out a datapoint $\mathbf{x}$ . The network learns to turn easy random codes into hard, structured data, so a new sample is nothing more than a fresh code, drawn and decoded.

The trouble is training it. To fit the network you want to maximize the probability the model assigns to your real data. Writing that probability down is easy. Computing it is not, because it hides an integral over every possible hidden code, and that integral has no closed form and far too many codes to sum over. The exact same integral blocks the other thing you would reach for, which is figuring out which code produced a given datapoint. Both the headline likelihood and the bookkeeping you would use to get it are out of reach.

Auto-Encoding Variational Bayes, by Diederik Kingma and Max Welling, is the paper that made this work at scale. It does two things. It optimizes a tractable lower bound instead of the impossible likelihood, and it trains a small second network to guess the hidden code, so you never have to solve for it. When both networks are neural nets fit jointly, the result is the variational autoencoder. To see why it works you only need a short stack of ideas: the intractable integral, the lower bound that dodges it, the autoencoder that the bound turns into, and the rewrite that lets a random sampling step carry a gradient.

The data, and the wall

Start with the generative model itself. There is a hidden code $\mathbf{z}$ , drawn from a fixed, featureless prior, a standard Gaussian $p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})$ with no parameters to learn. Then a decoder network with weights $\boldsymbol{\theta}$ turns that code into a distribution over data, $p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})$ . Sampling is a two-step recipe: draw $\mathbf{z}$ , then draw $\mathbf{x}$ from the decoder.

In the figure below, simple codes on the left stream through the decoder and land on the data.

Figure 1 · the generative model

A code z drawn from a plain Gaussian prior is pushed through the decoder and lands on the thin data manifold. Generation is exactly this: sample a simple cloud, decode it onto the structured sheet where real data lives.

Now the question that trains it. How probable is one real datapoint $\mathbf{x}$ under this model? You have to account for every code that could have produced it, weighting each by how likely that code was to begin with. That sum over codes is the marginal likelihood, also called the evidence:

p_{\boldsymbol{\theta}}(\mathbf{x}) = \int p_{\boldsymbol{\theta}}(\mathbf{z})\, p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})\, d\mathbf{z}

This is an average of the decoder over the entire prior. For any single pair $(\mathbf{x},\mathbf{z})$ the thing inside the integral, the joint $p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z}) = p_{\boldsymbol{\theta}}(\mathbf{z})\,p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})$ , is cheap to evaluate: run the decoder, multiply by the prior density. The integral is the part with no closed form. The code $\mathbf{z}$ lives in a continuous, high-dimensional space, the decoder is a nonlinear network, and there is no antiderivative and no small grid of codes that covers it. So the evidence is intractable. You cannot evaluate it, and you certainly cannot differentiate it to do gradient ascent.

The same problem reaches further. By Bayes' rule, the posterior, the distribution over codes that could have produced a given $\mathbf{x}$ , is

p_{\boldsymbol{\theta}}(\mathbf{z}\mid\mathbf{x}) = \frac{p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})\, p_{\boldsymbol{\theta}}(\mathbf{z})}{p_{\boldsymbol{\theta}}(\mathbf{x})}

and the denominator is that same impossible integral. The numerator is the cheap joint; the normalizer is the intractable term. So the posterior is intractable too. This is why the classical tool for latent-variable models, the EM (Expectation-Maximization) algorithm, stalls here: its E-step needs the exact posterior to fill in the hidden codes, and we just said we do not have it. (The usual workaround, mean-field variational inference, assumes a simple factorized posterior it can solve for in closed form, but that closed form needs a conjugate likelihood, which a neural network is not.) One integral blocks all three: maximizing the likelihood directly, EM, and mean-field variational inference.

Everything the paper does is a way around that one integral. The paper stops demanding the exact posterior and instead learns a cheap stand-in for it, a second network, and optimizes a bound on the evidence that this stand-in makes computable. That stand-in has a name in the paper, the recognition model $q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})$ , an encoder: feed it a datapoint and it returns a distribution over the codes that might have made it.

A bound you can actually optimize

We cannot touch $\log p_{\boldsymbol{\theta}}(\mathbf{x})$ . But there is an exact identity that splits it into two pieces, and one of those pieces is something we can compute. Introduce the encoder $q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})$ and the log-evidence decomposes as

\log p_{\boldsymbol{\theta}}(\mathbf{x}) = D_{\mathrm{KL}}\!\big(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) \,\|\, p_{\boldsymbol{\theta}}(\mathbf{z}\mid\mathbf{x})\big) + \mathcal{L}(\boldsymbol{\theta},\boldsymbol{\phi};\mathbf{x})

(1)

The first term is a KL divergence, a measure of how far one distribution is from another. It measures the gap between our encoder $q_{\boldsymbol{\phi}}$ and the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z}\mid\mathbf{x})$ . A KL divergence is never negative, and it is zero only when the two distributions match exactly. Because the left side $\log p_{\boldsymbol{\theta}}(\mathbf{x})$ is a fixed number once $\boldsymbol{\theta}$ is fixed, and the KL gap on the right cannot go below zero, the second term $\mathcal{L}$ can never exceed the evidence. It is a lower bound on $\log p_{\boldsymbol{\theta}}(\mathbf{x})$ , and the paper calls it the variational lower bound. Today it is usually called the evidence lower bound, or ELBO.

So the bound is something we can hold and push on, and pushing it up is productive in two ways at once. Hold the decoder fixed and improve the encoder: the bound rises, and since the total is pinned, the only thing that can give is the KL gap, which shrinks. Better bound and a better posterior approximation, from the same gradient step. The bound touches the evidence exactly when the encoder equals the true posterior, the moment the gap hits zero. We never reach it with a limited encoder, but we get as close as the encoder can manage.

Drag the encoder toward the true posterior and watch the gap close. The full bar is the (fixed) evidence; the teal part is the bound you are lifting, the amber part is the slack you are squeezing out.

Figure 2 · the bound and the gap

match q28%

Top: the true posterior and your encoder q; raising the quality slides q onto it. Bottom: the full bar is the evidence

\log p(\mathbf{x})

, split into the ELBO you maximize and the KL gap to the true posterior. Close the gap and the bound becomes tight. The gap reflects the cost of using an approximate encoder.

What is this bound, concretely? Two equivalent ways to write it, and you will use both. The paper's first form keeps the joint together:

\mathcal{L}(\boldsymbol{\theta},\boldsymbol{\phi};\mathbf{x}) = \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})}\!\big[\,{-\log q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})} + \log p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})\,\big]

(2)

If you have met Jensen's inequality, this is also where the bound comes from without any mention of a posterior: write $\log p_{\boldsymbol{\theta}}(\mathbf{x})$ as $\log \mathbb{E}_{q}\!\left[\,p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})/q_{\boldsymbol{\phi}}\,\right]$ , and since $\log$ is concave you can pull it inside the average to get a smaller number, $\mathbb{E}_{q}\!\left[\log\big(p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z})/q_{\boldsymbol{\phi}}\big)\right]$ . That is eq (2) again, and the size of Jensen's slack is exactly the KL gap from eq (1).

The second form is the one that makes everything compute. Split the joint $p_{\boldsymbol{\theta}}(\mathbf{x},\mathbf{z}) = p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})\,p(\mathbf{z})$ and collect terms:

\mathcal{L}(\boldsymbol{\theta},\boldsymbol{\phi};\mathbf{x}) = \underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})}\!\big[\log p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})\big]}_{\text{reconstruction}} \; - \; \underbrace{D_{\mathrm{KL}}\!\big(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) \,\|\, p(\mathbf{z})\big)}_{\text{regularizer}}

(3)

This KL is not the one from eq (1). Eq (1) measured distance to the true posterior $p_{\boldsymbol{\theta}}(\mathbf{z}\mid\mathbf{x})$ , the intractable thing. This one measures distance to the prior $p(\mathbf{z}) = \mathcal{N}(\mathbf{0},\mathbf{I})$ , which is fixed and known. Same symbol, different second argument, and only this one is computable. The first KL is the bound's invisible slack; this second KL is the term you actually evaluate.

Eq (3) also fixes which way the optimization runs. The paper maximizes $\mathcal{L}$ by gradient ascent, which is why the regularizer carries a minus sign. Modern code minimizes a loss, and the loss is the negative ELBO, so the same KL flips to a positive penalty:

\text{loss} = -\mathcal{L} = D_{\mathrm{KL}}\!\big(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) \,\|\, p(\mathbf{z})\big) - \mathbb{E}_{q_{\boldsymbol{\phi}}}\!\big[\log p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})\big]

Maximize the bound or minimize its negative, it is the same training, but mixing the two conventions is the single most common sign error in this whole subject, so you have to know which one is on screen before you trust any sign. From here on, when a quantity has a sign, the text says whether we are maximizing the bound or minimizing the loss.

The bound is already an autoencoder

Eq (3) describes a machine, not just a formula. The encoder $q_{\boldsymbol{\phi}}$ takes a datapoint $\mathbf{x}$ and produces a distribution over codes. You draw a code $\mathbf{z}$ from it. The decoder $p_{\boldsymbol{\theta}}$ takes that code and tries to rebuild $\mathbf{x}$ . The first term in eq (3) rewards a good rebuild, and the second constrains the codes toward the prior. That is an autoencoder, an encoder feeding a decoder through a bottleneck, with one twist: the bottleneck is a distribution you sample from, not a single point, and a regularizer constrains that distribution toward the prior.

Figure 3 · the variational autoencoder

The whole machine. An input x is encoded to a Gaussian

q(\mathbf{z}\mid\mathbf{x})

(the dashed ellipse), a code z is sampled with fresh noise

\varepsilon

, and the decoder rebuilds it as x̂. Two forces train it: the top bracket pulls x̂ toward x (reconstruction), the latent term pulls q toward the prior (the KL regularizer).

The two terms pull against each other, and that tension is what the model is balancing. The reconstruction term favors codes that carry as much about $\mathbf{x}$ as possible, ideally a sharp, distinctive code per datapoint so the decoder can rebuild it perfectly. The KL term pushes every code distribution toward the same bland prior, which blurs the codes together and throws information away. A plain autoencoder, trained on the reconstruction term alone, is under no pressure to place its codes anywhere in particular. In practice the codes scatter, the space between them decodes to garbage, which is why you cannot sample from a plain autoencoder. The KL term fills in the latent spacecompletely: every encoding is a little Gaussian parked near the origin, the encodings overlap and cover the prior, and so a fresh draw from $\mathcal{N}(\mathbf{0},\mathbf{I})$ lands in a region the decoder was actually trained to decode. Without the regularizer, generation would not work.

The encoder also solves the problem from the last section. We needed the posterior and could not compute it. Instead of solving for it per datapoint, we train one network that predicts an approximate posterior for any $\mathbf{x}$ in a single forward pass. That shortcut has a name, amortized inference. Classical variational inference runs a fresh inner optimization loop for every single datapoint, from scratch, to fit that point's posterior. The encoder replaces all of those loops: the optimization cost did not vanish, it was paid once, up front, into the encoder's weights $\boldsymbol{\phi}$ during training, which is exactly what "amortized" means here. The payoff lands at test time, where encoding a new, unseen $\mathbf{x}$ is one forward pass through $\boldsymbol{\phi}$ , not an optimization, which is what makes the encoder fast and reusable.

Laying the trade out as a cost ledger shows it is a deliberate bargain, not a cost-free choice. Classical variational inference, the thing the encoder replaces, treats each datapoint as its own optimization problem: hold the datapoint fixed, and tune that point's own $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ by gradient steps until its ELBO is as high as it goes. Do that for a thousand datapoints and you have run a thousand little optimizations, and a thousand-and-first datapoint at test time means a thousand-and-first optimization, from scratch. The encoder folds all of that into one cost paid during training: instead of solving per point, you fit a single network that maps any $\mathbf{x}$ straight to its $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ . You pay more up front, the encoder has its own weights to train, and you accept a real concession in return. One network, sharing its weights across every datapoint, cannot in general hit the exact per-point optimum that a fresh optimization would find for each one; the small shortfall between the encoder's output and that ideal per-point posterior is the amortization gap. The deal is plain: a fixed training cost and a sliver of lost tightness, traded for inference on any datapoint, seen or unseen, that is a single forward pass rather than an optimization loop. For a model you intend to run on far more data than you trained it on, that trade pays off many times over.

Watch the two totals cross. The slider sweeps the number of datapoints you encode, D, on a log axis; the classical line is a fresh optimization per point, the amortized line is the encoder's up-front training paid once, then one forward pass per point. At small D the encoder has not earned its keep yet and classical is cheaper; past the crossover the nearly flat amortized line wins, and the margin only grows.

Figure 4 · amortized inference: pay once, reuse

datapoints D50 · amortized wins

Total inference cost vs the number of datapoints D (log axis). Classical VI =

D \times

a per-point optimization loop, steep through the origin. Amortized = a fixed up-front training cost plus

D \times

one forward pass, high intercept but nearly flat, so it crosses below at

D^{*}\approx 20

and the gap widens. The side panel is the amortization gap: one shared encoder misses each point's exact optimum, a small fixed quality cost that does not grow with D. Cost units are relative and illustrative; the point is the crossover.

The KL, in closed form

One of the two terms comes out in closed form. If the encoder is a diagonal Gaussian, the standard VAE choice, then for each datapoint it outputs a mean vector $\boldsymbol{\mu}$ and a vector of standard deviations $\boldsymbol{\sigma}$ :

\log q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) = \log \mathcal{N}\!\big(\mathbf{z};\, \boldsymbol{\mu},\, \boldsymbol{\sigma}^{2}\mathbf{I}\big)

(9)

Both the prior and this $q$ are Gaussian, so the KL between them has a closed form. You do not sample it, you do not approximate it, you just evaluate it. With the prior $\mathcal{N}(\mathbf{0},\mathbf{I})$ , summing over the $J$ latent dimensions, the negative KL (the form the paper adds to its maximized bound) is

-D_{\mathrm{KL}}\!\big(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) \,\|\, p(\mathbf{z})\big) = \tfrac{1}{2}\sum_{j=1}^{J}\Big(1 + \log(\sigma_j^{2}) - \mu_j^{2} - \sigma_j^{2}\Big)

App. B

Look at what it favors. This expression is at its largest, exactly zero, when every $\mu_j$ is $0$ and every $\sigma_j$ is $1$ , which is to say when $q$ is the prior. Anywhere else it is negative, and the further $q$ drifts from the prior, the more negative it gets. A mean far from the origin is penalized through $-\mu_j^{2}$ ; a variance that is too large or too small is penalized through $\log(\sigma_j^2) - \sigma_j^2$ . In a minimized loss every sign flips and you penalize the positive KL instead, the form you will meet in code:

D_{\mathrm{KL}}\!\big(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) \,\|\, p(\mathbf{z})\big) = \tfrac{1}{2}\sum_{j=1}^{J}\big(\mu_j^{2} + \sigma_j^{2} - \log(\sigma_j^{2}) - 1\big)

It penalizes each code distribution in proportion to its distance from the standard Gaussian, opposed by the reconstruction term, which favors codes that are spread out and informative.

Move the encoder's output for one latent dimension and watch the spring. The teal bell is $q$ , the amber bell the fixed prior; the landscape below is the penalty, brightest where it vanishes.

Figure 5 · the KL regularizer

mean μμ = -1.10

s.d. σσ = 1.80

The regularizer for one latent coordinate. The encoder q =

\mathcal{N}(\mu,\sigma^2)

is pulled toward the fixed prior

\mathcal{N}(0,1)

. The heat map is

-D_{\mathrm{KL}}

over the

(\mu,\sigma)

plane, brightest at

(0,1)

where it is zero and the bells coincide. Every other setting is penalized.

With the KL handled exactly, only one term in the bound still involves an expectation you cannot integrate: the reconstruction $\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})}\!\big[\log p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})\big]$ . Drawing samples estimates its value well enough. The obstacle is its gradient, which is the hard part.

Backprop through a random sample

Write the reconstruction's integrand as $f(\mathbf{z}) = \log p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})$ . We need the gradient of $\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})}[f(\mathbf{z})]$ with respect to the encoder's parameters $\boldsymbol{\phi}$ . (The decoder $\boldsymbol{\theta}$ is the easy half: it sits inside $f$ at a fixed code, so its gradient passes straight through the sample, the way any gradient does. The hard half is $\boldsymbol{\phi}$ , which controls which codes get drawn at all.) The catch is that $\boldsymbol{\phi}$ does not sit inside $f$ , it sits inside the distribution we are averaging over. Change $\boldsymbol{\phi}$ and you change which codes $\mathbf{z}$ get sampled, not the value of $f$ at a fixed code. You cannot push the derivative through a sampling step the way you would through a fixed sum.

There is a textbook estimator that handles this, the score-function estimator (also called REINFORCE, or the likelihood-ratio estimator):

\nabla_{\boldsymbol{\phi}}\, \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})}\!\big[f(\mathbf{z})\big] = \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z})}\!\big[\,f(\mathbf{z})\,\nabla_{\boldsymbol{\phi}} \log q_{\boldsymbol{\phi}}(\mathbf{z})\,\big]

It is correct. It is unbiased, and it even works for discrete codes. But it is famously noisy, and the paper dismisses it in one line for being too high-variance to use. The reason is in the estimator's form. The cost enters multiplicatively: each estimate is $f(\mathbf{z})$ times a score, so its magnitude and its noise scale with how big $f$ is in absolute terms. Add a constant to $f$ and the true gradient does not move, yet the estimator gets noisier, because it never uses how $f$ changes with the code, only its raw height. It learns by correlation: codes that happened to score well should be a little more likely. With one sample per datapoint that correlation is mostly noise.

The paper's headline idea, the reparameterization trick, sidesteps the variance. Instead of drawing $\mathbf{z}$ directly from $q_{\boldsymbol{\phi}}$ , write the same draw as a deterministic function of the encoder's outputs and a fixed, parameter-free noise source. For the Gaussian encoder it is

\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\varepsilon}, \qquad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I})

(4)

where $\odot$ is the elementwise product. The randomness now lives entirely in $\boldsymbol{\varepsilon}$ , which does not depend on $\boldsymbol{\phi}$ at all; the encoder's job is the deterministic part, the shift $\boldsymbol{\mu}$ and the scale $\boldsymbol{\sigma}$ . Before the rewrite, the backward pass hit the sampling step and stopped, because there is no derivative of "draw a random number" with respect to the parameters that shaped its distribution. After the reparameterization, $\mathbf{z}$ is an ordinary differentiable formula of $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ , backprop treats it like any other layer, and the gradient of $f(\mathbf{z})$ flows straight back into the encoder, using $f$ 's own slope at the drawn code. That is why the variance drops: this estimate depends on how steeply $f$ leans near $\mathbf{z}$ , never on its raw height, so a smooth cost already points the right way from a single sample.

Toggle the two wirings. The forward pass is identical; only the backward pass differs.

Figure 6 · moving the randomness out of the way

mode

Sampling z directly puts a random node on the path, and you cannot push a gradient back through "draw a sample," so the encoder's

\mu,\sigma

get nothing (the red break). Rewrite the draw as

\mathbf{z} = \mu + \sigma\,\varepsilon

with the noise

\varepsilon

supplied from outside, and the gradient flows all the way back to

\mu,\sigma,\phi

Both estimators are unbiased for the same gradient, they agree in expectation, and what the trick changes is variance, not correctness. Nor is the lower variance a theorem, only a strong, reliable tendency: for a cost that varies very sharply with the code, the pathwise estimator can in principle be the noisier one. For the smooth reconstruction losses VAEs use, it wins comfortably, which is why a single sample per datapoint is enough to train.

How much does it actually help?

You can see the variance gap instead of taking it on faith. Take a one-dimensional toy: a code $z \sim \mathcal{N}(\mu, \sigma^2)$ and a smooth cost $f(z) = (z-a)^2$ , a stand-in for the reconstruction $f$ from before, and estimate the one gradient $\partial_{\mu}\, \mathbb{E}[f]$ , whose true value is $2(\mu - a)$ . Each estimator gets the same random draws. Drag the sample count up: both clouds of estimates converge on the true gradient (both are unbiased), but the score-function cloud stays splattered across a wide band while the reparameterized one sits in a tight knot around the answer.

Figure 7 · same target, different spread

samplesN = 220

Single-sample gradient estimates for the two methods on a shared axis centered at the true gradient. The score-function estimates scatter wide (large standard deviation, some off the axis); the reparameterized estimates cluster closely. Both running means converge to the true value as you add samples, so both are unbiased. Only the spread differs, and with one sample, the spread is the limiting factor.

That low variance lets the paper's estimator use a single code per datapoint and still train: at a minibatch of a hundred, the reparameterized gradients are already clean enough to point the right way.

The whole algorithm

Now assemble the pieces into the objective the paper actually optimizes. Take the bound in its reconstruction- minus-KL form (3), keep the KL analytic, and estimate only the reconstruction by sampling codes through the reparameterization. For a single datapoint with a Gaussian encoder this is the paper's eq (10):

\mathcal{L}(\boldsymbol{\theta},\boldsymbol{\phi};\mathbf{x}) \simeq \tfrac{1}{2}\sum_{j=1}^{J}\Big(1 + \log(\sigma_j^{2}) - \mu_j^{2} - \sigma_j^{2}\Big) + \frac{1}{L}\sum_{l=1}^{L} \log p_{\boldsymbol{\theta}}\!\big(\mathbf{x}\mid\mathbf{z}^{(l)}\big)

(10)

with $\mathbf{z}^{(l)} = \boldsymbol{\mu} + \boldsymbol{\sigma}\odot\boldsymbol{\varepsilon}^{(l)}$ and $\boldsymbol{\varepsilon}^{(l)} \sim \mathcal{N}(\mathbf{0},\mathbf{I})$ . This whole quantity is a single differentiable number. What you just built, the analytic KL plus a sampled reconstruction, is the paper's lower-variance estimator, the SGVB (Stochastic Gradient Variational Bayes) estimator, SGVB ^B. (There is also a fully generic SGVB ^A that Monte-Carlo-estimates both terms, for a non-Gaussian prior or encoder where the KL has no closed form.) Differentiate it with respect to both $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$ at once and take a gradient step. To scale to a large dataset you do this on a minibatch and rescale, eq (8):

\mathcal{L}(\boldsymbol{\theta},\boldsymbol{\phi};\mathbf{X}) \;\simeq\; \frac{N}{M}\sum_{i=1}^{M} \mathcal{L}\big(\boldsymbol{\theta},\boldsymbol{\phi};\mathbf{x}^{(i)}\big)

(8)

The paper finds $L = 1$ sample per datapoint is enough as long as the minibatch is reasonably large, around $M = 100$ , because the minibatch already averages the per-sample noise. So in practice you draw exactly one $\boldsymbol{\varepsilon}$ per datapoint. Written out, one training step is short:

# one AEVB step on a minibatch X of M datapoints; prior p(z) = N(0, I)
mu, logvar = encoder(X)                    # q(z|x) = N(mu, exp(logvar))
eps        = randn_like(mu)                # external noise, no gradient
z          = mu + exp(0.5 * logvar) * eps  # reparameterize  (eq 4)
x_hat      = decoder(z)                     # p(x|z)
recon = log_p_x_given_z(X, x_hat)          # +log-likelihood (eq 11 / 12)
kl    = -0.5 * sum(1 + logvar - mu**2 - exp(logvar))   # = +KL  (eq 10)
loss  = mean(kl - recon)                   # minimize -ELBO  (= -L)
loss.backward(); opt.step()                # grad -> encoder AND decoder

The listing hides two details. The encoder does not output $\boldsymbol{\sigma}$ directly; it outputs $\log \boldsymbol{\sigma}^2$ , the log-variance, and you recover $\boldsymbol{\sigma} = \exp(\tfrac{1}{2}\log\boldsymbol{\sigma}^2)$ . The exponential keeps the variance positive no matter what the network outputs, and is well-behaved under gradients. And the single loss.backward() sends gradients into the decoder $\boldsymbol{\theta}$ through the reconstruction term and into the encoder $\boldsymbol{\phi}$ through both terms, the KL directly and the reconstruction through the reparameterized code.

Here is a concrete instance. Say $\mathbf{x}$ is a binarized MNIST digit, a 784-dimensional vector of zeros and ones. The encoder is an MLP (a plain multi-layer perceptron, i.e. a fully-connected feedforward net) with a 500-unit hidden layer that reads those 784 numbers and outputs two 20-dimensional vectors, $\boldsymbol{\mu}$ and $\log\boldsymbol{\sigma}^2$ , so the latent is $J = 20$ dimensional. Draw one $\boldsymbol{\varepsilon} \in \mathbb{R}^{20}$ from $\mathcal{N}(\mathbf{0},\mathbf{I})$ , form $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma}\odot\boldsymbol{\varepsilon}$ , and a decoder MLP (also one 500-unit hidden layer) maps that 20-vector back to the reconstruction the pipeline figure called $\hat{\mathbf{x}}$ , here a vector of 784 per-pixel Bernoulli means $\hat{\mathbf{y}} \in (0,1)^{784}$ . The reconstruction term is the Bernoulli log-likelihood of the real pixels under those probabilities, eq (11):

\log p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}) = \sum_{i=1}^{784} \Big[\, x_i \log \hat{y}_i + (1-x_i)\log(1-\hat{y}_i) \,\Big]

(11)

and the KL term is the 20-term positive sum from above. The loss is that KL minus this reconstruction, a single scalar, and backprop updates both MLPs. (For continuous data like the Frey Face images the decoder instead outputs a mean and a variance and the reconstruction is a Gaussian log-density, eq (12); everything else is identical.) Once it is trained you can throw the encoder away and just generate:

# generation: no encoder, no data, just ancestral sampling
z     = randn(n, J)        # z ~ prior N(0, I)
x_hat = decoder(z)         # p(x|z): a brand-new sample, never seen in training

Sample a code from the prior, run the decoder, and you have a new datapoint the model invented. The encoder existed only to make training tractable. At generation time the codes come straight from the prior, and the decoder alone turns them into data.

A latent space you can walk

The regularizer promised something the reconstruction alone could not: a latent space with no holes, where every point decodes to something sensible and nearby points decode to similar things. That is not just a sampling convenience, it is a learned, organized coordinate system for the data. The paper makes the point with a picture. Train a VAE with a two-dimensional latent so you can see all of it, then sweep a grid of codes across the plane and decode each one.

Move the cursor through the latent plane. The decoded digit morphs smoothly, with no seams and no gaps.

Figure 8 · the learned manifold

A grid of codes

\mathbf{z}

across the 2-D latent plane, each decoded to a reconstruction. The shapes morph smoothly from cell to neighbouring cell, with no abrupt jumps and no dead spots: the KL regularizer organized the entire plane. The inset decodes the exact code under the cursor. The paper's Figure 4 is this picture on real MNIST.

The figure invites one careful detail. To make a grid that matches the prior, the paper (unlike this evenly-spaced demo grid) does not space the codes uniformly. It lays a regular grid on the unit square and passes it through the inverse Gaussian CDF, so the grid is dense where the prior puts its mass and sparse in the tails. That is the correct way to tile a Gaussian latent. The takeaway is the smoothness: because every datapoint was encoded to a Gaussian pinned near the origin, the codes pack together and overlap, and the decoder learns a continuous map off of them. Walking a straight line in latent space gives you a smooth interpolation in data space, which is the property that made VAEs a workhorse for representation learning.

MNIST, Frey faces, and a lower bound

The paper is from 2013, and its results read best in their own terms, not through a modern lens. There is no FID, no Inception score, no CIFAR or ImageNet; those came years later. The datasets are MNIST and the Frey Face video frames. The metric is the variational lower bound itself, measured in nats, plus an estimate of the marginal likelihood for a small latent. The optimizer is Adagrad (an early adaptive-learning-rate method), not Adam, the now-standard optimizer Kingma would only publish the following year.

The comparison is against the two reasonable alternatives of the day, the wake-sleep algorithm (an earlier scheme that trains the encoder and decoder in alternating passes) and Monte Carlo EM, which replaces the intractable E-step posterior with samples drawn by a Hybrid Monte Carlo (HMC) sampler. On both datasets AEVB (the paper's training algorithm) converges faster and reaches a better bound, and it keeps scaling where Monte Carlo EM, which needs an expensive sampling loop per datapoint, cannot follow onto the full dataset. The settings are modest by today's standards: minibatches of $M = 100$ , one latent sample each, 500 hidden units for MNIST and 200 for the smaller Frey Face set, weights initialized from $\mathcal{N}(0, 0.01)$ .

The authors were surprised that adding more latent dimensions did not cause overfitting. A plain autoencoder with a fat bottleneck would happily memorize; this one did not, because the KL term regularizes the model at no extra cost, and superfluous latent dimensions just collapse back onto the prior and stop carrying information.

The trick's impact went well beyond these experiments. The reparameterization estimator turned variational inference into something you could drop into any gradient-based pipeline, and the years after this paper are a long list of things built on it. There are real limits, most of which became their own research programs. The encoder is only an approximation, so there is an amortization gap between it and the best per-datapoint posterior, which later work attacks with tighter bounds and richer encoders. A KL term weighted too strongly can drive a latent dimension fully onto the prior so it carries no information, the failure called posterior collapse; turning its weight down with a single coefficient gives the β-VAE. The trick needs a continuous code, so discrete latents had to wait for the score-function estimator or a relaxation like Gumbel-Softmax. And the Gaussian decoder returns a blurry mean, which is part of why VAEs alone make soft samples.

None of that has dislodged the core idea. A latent-variable model has a likelihood you cannot compute, so you settle for a bound, hand the posterior to a learned encoder, and rewrite the single sampling step until a gradient can pass through it. That recipe still trains a great deal of modern generative modeling, right down to the autoencoder that compresses images into the latent space that diffusion models like Stable Diffusion actually run in (see the latent diffusion explainer). Before this paper, the clean probabilistic story (sample a code, decode it) had no clean way to train; the variational autoencoder gave it one, end to end, with nothing fancier than backprop.

Provenance Verified against primary literature

Kingma & Welling (2013)The SGVB estimator, the reparameterization trick, and the AEVB algorithm.

Bishop, PRML ch. 9The exact decomposition log p(x) = ELBO + KL gap, and why EM needs the posterior.

Mohamed et al. (2020)Score-function vs pathwise gradients: both unbiased, very different variance.

Kingma & Welling (2019)The log-variance parameterization and the minimize-the-negative-ELBO convention.

caveatThe paper maximizes the ELBO and writes the prior KL with a minus sign. Modern code minimizes the negative ELBO, so the same KL is a positive penalty. We show both and label which sign is on screen.

Questions you might still have

If we never compute log p(x), how do we know training improves it?
We climb the ELBO, a lower bound on log p(x). Raising it either lifts log p(x) itself or closes the gap to the true posterior (the gap is exactly a KL). Either way the model and its encoder both get better.

Why add noise to the code at all? A plain autoencoder reconstructs better.
A plain autoencoder can scatter codes anywhere and leave gaps that decode to nothing. The KL term forces every encoding to be a Gaussian sitting near the prior, so every region of the latent space decodes to something. That is what makes sampling and interpolation work.

Both gradient estimators are unbiased. So why does variance matter?
With one sample per datapoint (the paper uses L = 1), a high-variance gradient is mostly noise, and you would need far more samples or steps to average it away. The reparameterized estimator is precise enough that a single sample already trains.

Does the trick work for discrete latents, like words or tokens?
Not directly. z = mu + sigma·eps needs a continuous, differentiable path. Discrete latents fall back on the score-function estimator or a continuous relaxation (Gumbel-Softmax), which came years later.

Footnotes & further reading

The paper: Kingma & Welling, Auto-Encoding Variational Bayes (Universiteit van Amsterdam, ICLR 2014).
The ELBO / EM decomposition: Bishop, Pattern Recognition and Machine Learning, ch. 9, and Blei, Kucukelbir & McAuliffe, Variational Inference: A Review for Statisticians.
Score-function vs pathwise gradients (both unbiased, different variance): Mohamed, Rosca, Figurnov & Mnih, Monte Carlo Gradient Estimation in Machine Learning.
The same reparameterized estimator, developed independently and concurrently: Rezende, Mohamed & Wierstra, Stochastic Backpropagation and Approximate Inference in Deep Generative Models.
The modern reference, with the log-variance convention and the loss framing: Kingma & Welling, An Introduction to Variational Autoencoders.
What came next: tighter bounds (Burda et al., IWAE), discrete latents (Jang et al., Gumbel-Softmax), disentangling (Higgins et al., β-VAE), and the autoencoder under latent diffusion (Stable Diffusion).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.