Auto-Encoding Variational Bayes
How to backpropagate through a random sample.
A latent-variable model can dream up new data, but the likelihood you would train it on is an integral nobody can compute. The variational autoencoder gets around that with one trick: it rewrites a random sample so the gradient can flow straight through it, and a generative model you can train with plain backprop falls out.
Explaining the paperAuto-Encoding Variational BayesWhat does it take to turn an autoencoder, a thing that just copies its input, into a model that invents new data you have never seen?
Here is the dream that organizes a lot of generative modeling. You would like a machine that can produce new examples of something. New handwritten digits, new faces, new molecules. The cleanest way to set that up is a latent-variable model. Pick a hidden code from a simple distribution, push it through a network, and read out a datapoint . The network learns to turn easy random codes into hard, structured data. To make a new sample you draw a fresh code and decode it. That is the whole generative story, and it is simple.
The trouble is training it. To fit the network you want to maximize the probability the model assigns to your real data. Writing that probability down is easy. Computing it is not, because it hides an integral over every possible hidden code, and that integral has no closed form and far too many codes to sum over. The exact same integral blocks the other thing you would reach for, which is figuring out which code produced a given datapoint. Both the headline likelihood and the bookkeeping you would use to get it are out of reach.
Auto-Encoding Variational Bayes, by Diederik Kingma and Max Welling, is the paper that made this work at scale. It does two things. It optimizes a tractable lower bound instead of the impossible likelihood, and it trains a small second network to guess the hidden code, so you never have to solve for it. When both networks are neural nets fit jointly, the result is the variational autoencoder. To see why it works you only need a short stack of ideas: the intractable integral, the lower bound that dodges it, the autoencoder that the bound turns into, and the rewrite that lets a random sampling step carry a gradient.
The data, and the wall
Start with the generative model itself. There is a hidden code , drawn from a fixed, featureless prior, a standard Gaussian with no parameters to learn. Then a decoder network with weights turns that code into a distribution over data, . Sampling is a two-step recipe: draw , then draw from the decoder. A simple cloud goes in, structured data comes out.
Drag nothing, just watch: simple codes on the left stream through the decoder and land on the data.
Now the question that trains it. How probable is one real datapoint under this model? You have to account for every code that could have produced it, weighting each by how likely that code was to begin with. That sum over codes is the marginal likelihood, also called the evidence:
Read it as an average of the decoder over the whole prior. For any single pair the thing inside the integral, the joint , is cheap to evaluate: run the decoder, multiply by the prior density. The integral is what kills you. The code lives in a continuous, high-dimensional space, the decoder is a nonlinear network, and there is no antiderivative and no small grid of codes that covers it. So the evidence is intractable. You cannot evaluate it, and you certainly cannot differentiate it to do gradient ascent.
It gets worse in a useful way. By Bayes' rule, the posterior, the distribution over codes that could have produced a given , is
and the denominator is that same impossible integral. The numerator is the cheap joint; the normalizer is the wall. So the posterior is intractable too. This is why the classical tool for latent-variable models, the EM algorithm, stalls here: its E-step needs the exact posterior to fill in the hidden codes, and we just said we do not have it. (The usual workaround, mean-field variational inference, assumes a simple factorized posterior it can solve for in closed form, but that closed form needs a conjugate likelihood, which a neural network is not.) One integral, three dead ends.
Everything the paper does is a way around that one integral. The move is to stop demanding the exact posterior and instead learn a cheap stand-in for it, a second network, and to optimize a bound on the evidence that this stand-in makes computable. That stand-in has a name in the paper, the recognition model , an encoder: feed it a datapoint and it returns a distribution over the codes that might have made it.
A bound you can actually optimize
We cannot touch . But there is an exact identity that splits it into two pieces, and one of those pieces is something we can compute. Introduce the encoder and the log-evidence decomposes as
The first term is a KL divergence, a measure of how far one distribution is from another. It measures the gap between our encoder and the true posterior . A KL divergence is never negative, and it is zero only when the two distributions match exactly. That single fact is the whole lever. Because the left side is a fixed number once is fixed, and the KL gap on the right cannot go below zero, the second term can never exceed the evidence. It is a lower bound on , and the paper calls it the variational lower bound. Today it is usually called the evidence lower bound, or ELBO.
So we get a bound for free, and pushing the bound up is productive in two ways at once. Hold the decoder fixed and improve the encoder: the bound rises, and since the total is pinned, the only thing that can give is the KL gap, which shrinks. Better bound and a better posterior approximation, from the same gradient step. The bound touches the evidence exactly when the encoder equals the true posterior, the moment the gap hits zero. We never reach it with a limited encoder, but we get as close as the encoder can manage.
Drag the encoder toward the true posterior and watch the gap close. The full bar is the (fixed) evidence; the teal part is the bound you are lifting, the amber part is the slack you are squeezing out.
What is this bound, concretely? Two equivalent ways to write it, and you will use both. The paper's first form keeps the joint together:
If you have met Jensen's inequality, this is also where the bound comes from without any mention of a posterior: write as , and since is concave you can pull it inside the average to get a smaller number, . That is eq (2) again, and the size of Jensen's slack is exactly the KL gap from eq (1). Two doors into the same room.
The second form is the one that makes everything compute. Split the joint and collect terms:
A warning that has tripped up many readers: this KL is not the one from eq (1). Eq (1) measured distance to the true posterior , the intractable thing. This one measures distance to the prior , which is fixed and known. Same symbol, different second argument, and only this one is computable. The first KL is the bound's invisible slack; this second KL is the term you actually evaluate.
One bookkeeping note you have to keep straight, because it is the single most common sign error in this whole subject. The paper maximizes by gradient ascent, and writes the regularizer with a minus sign, as in eq (3). Modern code minimizes a loss, and the loss is just the negative ELBO, so the same KL flips to a positive penalty:
Maximize the bound or minimize its negative, it is the same training. But you have to know which one is on screen before you trust any sign. From here on, when a quantity has a sign, the text says whether we are maximizing the bound or minimizing the loss.
An autoencoder falls out
Read eq (3) as a machine rather than a formula. The encoder takes a datapoint and produces a distribution over codes. You draw a code from it. The decoder takes that code and tries to rebuild . The first term in eq (3) rewards a good rebuild, and the second keeps the codes tidy. That is an autoencoder, an encoder feeding a decoder through a bottleneck, with one twist: the bottleneck is a distribution you sample from, not a single point, and a regularizer keeps that distribution honest.
The two terms pull against each other, and that tension is the point. The reconstruction term wants codes that carry as much about as possible, ideally a sharp, distinctive code per datapoint so the decoder can rebuild it perfectly. The KL term wants every code distribution to look like the same bland prior, which blurs the codes together and throws information away. A plain autoencoder, with only the reconstruction term, is free to scatter its codes anywhere it likes. It will, and the space between codes decodes to garbage, which is why you cannot sample from a plain autoencoder. The KL term is what makes the latent space filled in: every encoding is a little Gaussian parked near the origin, the encodings overlap and cover the prior, and so a fresh draw from lands somewhere the decoder actually knows. The regularizer is not a tax on the model. It is the thing that makes generation possible at all.
The encoder also solves the problem from the last section. We needed the posterior and could not compute it. Instead of solving for it per datapoint, we train one network that predicts an approximate posterior for any in a single forward pass. That shortcut has a name, amortized inference: the cost of inference is paid once, in the encoder's weights , and then reused for every datapoint, instead of running a fresh optimization for each one the way classical variational inference does. The payoff lands at test time: encoding a new, unseen is one forward pass through , not an optimization, which is what makes the encoder fast and reusable.
The KL, in closed form
One of the two terms is a gift. If the encoder is a diagonal Gaussian, the standard VAE choice, then for each datapoint it outputs a mean vector and a vector of standard deviations :
Both the prior and this are Gaussian, so the KL between them has a closed form. You do not sample it, you do not approximate it, you just evaluate it. With the prior , summing over the latent dimensions, the negative KL (the form the paper adds to its maximized bound) is
Look at what it wants. This expression is at its largest, exactly zero, when every is and every is , which is to say when is the prior. Anywhere else it is negative, and the further drifts from the prior, the more negative it gets. A mean far from the origin is penalized through ; a variance that is too large or too small is penalized through . In a minimized loss every sign flips and you penalize the positive KL instead, the form you will meet in code:
It is a soft spring tethering each code distribution to the standard Gaussian, fought by the reconstruction term that wants the codes spread out and informative.
Move the encoder's output for one latent dimension and watch the spring. The teal bell is , the amber bell the fixed prior; the landscape below is the penalty, brightest where it vanishes.
With the KL handled exactly, only one term in the bound still involves an expectation you cannot integrate: the reconstruction . Drawing samples estimates its value well enough. The obstacle is its gradient, and that is where the real difficulty hides.
Backprop through a random sample
Write the reconstruction's integrand as . We need the gradient of with respect to the encoder's parameters . (The decoder is the easy half: it sits inside at a fixed code, so its gradient passes straight through the sample, the way any gradient does. The hard half is , which controls which codes get drawn at all.) And that is the subtlety: does not sit inside , it sits inside the distribution we are averaging over. Change and you change which codes get sampled, not the value of at a fixed code. You cannot push the derivative through a sampling step the way you would through a fixed sum.
There is a textbook estimator that handles this, the score-function estimator (also called REINFORCE, or the likelihood-ratio estimator):
It is correct. It is unbiased, and it even works for discrete codes. But it is famously noisy, and the paper dismisses it in one line for being too high-variance to use. The reason is in its shape. The cost enters multiplicatively: each estimate is times a score, so its magnitude and its noise scale with how big is in absolute terms. Add a constant to and the true gradient does not move, yet the estimator gets noisier, because it never uses how changes with the code, only its raw height. It learns by correlation: codes that happened to score well should be a little more likely. With one sample per datapoint that correlation is mostly noise.
The fix is the paper's headline idea, the reparameterization trick. Instead of drawing directly from , write the same draw as a deterministic function of the encoder's outputs and a fixed, parameter-free noise source. For the Gaussian encoder that is just
where is the elementwise product. The randomness now lives entirely in , which does not depend on at all. The encoder's job is the deterministic part, the shift and the scale . And because is now a plain differentiable function of and , the gradient of flows straight back through it into the encoder, using 's own slope at the drawn code. That is the difference that kills the variance: this estimate depends on how steeply leans near , never on its raw height, so a smooth cost already points the right way from a single sample. Same distribution of , completely different gradient.
Toggle the two wirings. The forward pass is identical; the backward pass is the whole story.
The slogans usually skip two caveats that matter. Both estimators are unbiased for the same gradient; they agree in expectation. What the trick changes is variance, not correctness. And lower variance is a strong, reliable tendency, not a theorem: for a cost that varies very sharply with the code, the pathwise estimator can in principle be the noisier one. For the smooth reconstruction losses VAEs use, it wins comfortably, which is the whole reason a single sample per datapoint is enough to train.
How much does it actually help?
You can see the variance gap instead of taking it on faith. Take a one-dimensional toy: a code and a smooth cost , a stand-in for the reconstruction from before, and estimate the one gradient , whose true value is . Each estimator gets the same random draws. Drag the sample count up: both clouds of estimates converge on the true gradient (both are unbiased), but the score-function cloud stays splattered across a wide band while the reparameterized one sits in a tight knot around the answer.
That tightness is what makes the whole scheme practical. The paper's estimator uses a single code per datapoint, and it trains, because at a minibatch of a hundred the reparameterized gradients are already clean enough to point the right way.
The whole algorithm
Now assemble the pieces into the objective the paper actually optimizes. Take the bound in its reconstruction- minus-KL form (3), keep the KL analytic, and estimate only the reconstruction by sampling codes through the reparameterization. For a single datapoint with a Gaussian encoder this is the paper's eq (10):
with and . This whole quantity is a single differentiable number. What you just built, the analytic KL plus a sampled reconstruction, is the paper's lower-variance estimator, SGVB B. (There is also a fully generic SGVB A that Monte-Carlo-estimates both terms, for a non-Gaussian prior or encoder where the KL has no closed form.) Differentiate it with respect to both and at once and take a gradient step. To scale to a large dataset you do this on a minibatch and rescale, eq (8):
The paper finds sample per datapoint is enough as long as the minibatch is reasonably large, around , because the minibatch already averages the per-sample noise. So in practice you draw exactly one per datapoint. Written out, one training step is short:
# one AEVB step on a minibatch X of M datapoints; prior p(z) = N(0, I)
mu, logvar = encoder(X) # q(z|x) = N(mu, exp(logvar))
eps = randn_like(mu) # external noise, no gradient
z = mu + exp(0.5 * logvar) * eps # reparameterize (eq 4)
x_hat = decoder(z) # p(x|z)
recon = log_p_x_given_z(X, x_hat) # +log-likelihood (eq 11 / 12)
kl = -0.5 * sum(1 + logvar - mu**2 - exp(logvar)) # = +KL (eq 10)
loss = mean(kl - recon) # minimize -ELBO (= -L)
loss.backward(); opt.step() # grad -> encoder AND decoderThe listing hides two details. The encoder does not output directly; it outputs , the log-variance, and you recover . That keeps the variance positive for free and behaves under gradients. And the single loss.backward() sends gradients into the decoder through the reconstruction term and into the encoder through both terms, the KL directly and the reconstruction through the reparameterized code. One backward pass trains the entire model.
Make it concrete. Say is a binarized MNIST digit, a 784-dimensional vector of zeros and ones. The encoder is an MLP with a 500-unit hidden layer that reads those 784 numbers and outputs two 20-dimensional vectors, and , so the latent is dimensional. Draw one from , form , and a decoder MLP (also one 500-unit hidden layer) maps that 20-vector back to the reconstruction the pipeline figure called , here a vector of 784 per-pixel Bernoulli means . The reconstruction term is the Bernoulli log-likelihood of the real pixels under those probabilities, eq (11):
and the KL term is the 20-term positive sum from above. The loss is that KL minus this reconstruction, a single scalar, and backprop updates both MLPs. (For continuous data like the Frey Face images the decoder instead outputs a mean and a variance and the reconstruction is a Gaussian log-density, eq (12); everything else is identical.) Once it is trained you can throw the encoder away and just generate:
# generation: no encoder, no data, just ancestral sampling
z = randn(n, J) # z ~ prior N(0, I)
x_hat = decoder(z) # p(x|z): a brand-new sample, never seen in trainingSample a code from the prior, run the decoder, and you have a new datapoint the model invented. The encoder existed only to make training tractable. At generation time the latent space is the prior, and the decoder is the artist.
A latent space you can walk
The regularizer promised something the reconstruction alone could not: a latent space with no holes, where every point decodes to something sensible and nearby points decode to similar things. That is not just a sampling convenience, it is a learned, organized coordinate system for the data. The paper makes the point with a picture. Train a VAE with a two-dimensional latent so you can see all of it, then sweep a grid of codes across the plane and decode each one.
Move the cursor through the latent plane. The decoded digit morphs smoothly, with no seams and no gaps.
The figure invites one careful detail. To make a grid that matches the prior, the paper (unlike this evenly-spaced demo grid) does not space the codes uniformly. It lays a regular grid on the unit square and passes it through the inverse Gaussian CDF, so the grid is dense where the prior puts its mass and sparse in the tails. That is the honest way to tile a Gaussian latent. The takeaway is the smoothness: because every datapoint was encoded to a Gaussian pinned near the origin, the codes pack together and overlap, and the decoder learns a continuous map off of them. Walking a straight line in latent space gives you a smooth interpolation in data space, which is the property that made VAEs a workhorse for representation learning.
What it actually showed
The paper is from 2013, and its results read best in their own terms, not through a modern lens. There is no FID, no Inception score, no CIFAR or ImageNet; those came years later. The datasets are MNIST and the Frey Face video frames. The metric is the variational lower bound itself, measured in nats, plus an estimate of the marginal likelihood for a small latent. The optimizer is Adagrad, not Adam, which Kingma would only publish the following year.
The comparison is against the two reasonable alternatives of the day, the wake-sleep algorithm and Monte Carlo EM with a Hybrid Monte Carlo sampler. On both datasets AEVB converges faster and reaches a better bound, and it keeps scaling where Monte Carlo EM, which needs an expensive sampling loop per datapoint, cannot follow onto the full dataset. The settings are modest by today's standards: minibatches of , one latent sample each, 500 hidden units for MNIST and 200 for the smaller Frey Face set, weights initialized from .
One finding surprised the authors. Adding more latent dimensions did not cause overfitting. A plain autoencoder with a fat bottleneck would happily memorize; this one did not, because the KL term regularizes the model for free, and superfluous latent dimensions just collapse back onto the prior and stop carrying information. The bound polices its own capacity.
What the trick opened up is larger than the experiments. The reparameterization estimator turned variational inference into something you could drop into any gradient-based pipeline, and the years after this paper are a long list of things built on it. There are honest limits, most of which became their own research programs. The encoder is only an approximation, so there is an amortization gap between it and the best per-datapoint posterior, which later work attacks with tighter bounds and richer encoders. The KL term can be too eager and shut a latent dimension off entirely, the failure called posterior collapse; turning its weight down with a single coefficient gives the β-VAE. The trick needs a continuous code, so discrete latents had to wait for the score-function estimator or a relaxation like Gumbel-Softmax. And the Gaussian decoder returns a blurry mean, which is part of why VAEs alone make soft samples.
None of that has dislodged the core idea. A latent-variable model has an intractable likelihood. Bound it, learn an encoder to approximate the posterior, and reparameterize the one sampling step so the whole thing is differentiable. That recipe is still how a great deal of modern generative modeling is trained, right down to the autoencoder that compresses images into the latent space that diffusion models like Stable Diffusion actually run in. The variational autoencoder was one of the first times you could take the clean probabilistic story, sample a code and decode it, and train it end to end with nothing fancier than backprop.
Questions you might still have
If we never compute log p(x), how do we know training improves it?
We climb the ELBO, a lower bound on log p(x). Raising it either lifts log p(x) itself or closes the gap to the true posterior (the gap is exactly a KL). Either way the model and its encoder both get better.
Why add noise to the code at all? A plain autoencoder reconstructs better.
A plain autoencoder can scatter codes anywhere and leave gaps that decode to nothing. The KL term forces every encoding to be a Gaussian sitting near the prior, so the whole latent space decodes to something. That is what makes sampling and interpolation work.
Both gradient estimators are unbiased. So why does variance matter?
With one sample per datapoint (the paper uses L = 1), a high-variance gradient is mostly noise, and you would need far more samples or steps to average it away. The reparameterized estimator is tight enough that a single sample already trains.
Does the trick work for discrete latents, like words or tokens?
Not directly. z = mu + sigma·eps needs a continuous, differentiable path. Discrete latents fall back on the score-function estimator or a continuous relaxation (Gumbel-Softmax), which came years later.
Footnotes & further reading
- The paper: Kingma & Welling, Auto-Encoding Variational Bayes (Universiteit van Amsterdam, ICLR 2014).
- The ELBO / EM decomposition: Bishop, Pattern Recognition and Machine Learning, ch. 9, and Blei, Kucukelbir & McAuliffe, Variational Inference: A Review for Statisticians.
- Score-function vs pathwise gradients (both unbiased, different variance): Mohamed, Rosca, Figurnov & Mnih, Monte Carlo Gradient Estimation in Machine Learning.
- The same reparameterized estimator, developed independently and concurrently: Rezende, Mohamed & Wierstra, Stochastic Backpropagation and Approximate Inference in Deep Generative Models.
- The modern reference, with the log-variance convention and the loss framing: Kingma & Welling, An Introduction to Variational Autoencoders.
- What came next: tighter bounds (Burda et al., IWAE), discrete latents (Jang et al., Gumbel-Softmax), disentangling (Higgins et al., β-VAE), and the autoencoder under latent diffusion (Stable Diffusion).
How could this explainer be improved? Found an error, or something unclear? I read every message.