Denoising Diffusion Probabilistic Models
Destroy an image with noise, then learn the walk back.
The forward half throws an image into static and needs no learning at all. The whole model is one network trained to guess the noise. Run it in reverse and a Gaussian blob turns into a picture.
Explaining the paperDenoising Diffusion Probabilistic ModelsWhat if generating an image were just denoising, run eight hundred times in a row?
By 2020 the way to generate a sharp image was a GAN: pit a generator against a discriminator and hope the arms race converges. GANs made beautiful samples and were miserable to train, prone to collapsing onto a handful of outputs and to never quite settling. Likelihood models like autoregressive Transformers and flows trained more honestly but lagged on sample quality. There was an older, quieter idea sitting in the corner, diffusion probabilistic models, that nobody had ever pushed to high quality.
This paper pushed it. The recipe is almost suspiciously plain. Take a clean image and add a little Gaussian noise. Add a little more. Keep going for a thousand tiny steps until the image is indistinguishable from television static. That part is mechanical, no network involved. Now train a single network to undo one step: given a noisy image, guess the noise that was added. To generate, start from pure static and apply that one network over and over, peeling noise off a bit at a time, until an image appears.
On CIFAR-10 it reached a FID of 3.17, state of the art for unconditional generation, finally competitive with the GANs on their own turf. The paper that did it is short, and the math underneath is a tower of five ideas, each one small. We will build them in order: the forward noising, the shortcut that lets you noise to any level in one step, the reverse process and the bound that trains it, the tractable target hiding inside that bound, and the reparameterization that collapses the whole thing into "predict the noise." Stack those and the paper falls out.
The forward process: a slow drown
Start with the easy direction, the one that destroys information. Call the clean image . The forward process (the paper also calls it the diffusion process) is a fixed Markov chain that adds a pinch of Gaussian noise at each of steps, producing a sequence of progressively noisier images . Each step is one Gaussian:
Read the two pieces. The mean is the previous image shrunk by , a hair below one, so the signal fades a touch. The covariance is , a fresh splash of isotropic noise. The are a fixed variance schedule, small numbers that say how much noise to add at step . They are not learned. In the paper rises linearly from to over steps. The shrink-then-add structure is deliberate: it keeps the total variance from blowing up, which is why this is called the variance-preserving setup.
Chaining the steps gives the whole forward trajectory as a product of these Gaussians:
Because each step only adds noise, the chain has no memory of structure to preserve and the destination is fixed in advance. After enough steps, whatever you started from has dissolved into the same featureless Gaussian. Drag the timestep below and watch a small wordmark drown:
That is the entire forward half. It is intentionally dumb. All of the intelligence is going to live in running it backward, and to set that up we need one more fact about the forward chain.
Jump to any noise level in one step
Training will need noisy images at random timesteps. Marching the chain forward step by step to reach would be slow. The forward process has a gift that makes this free: you can sample directly from in closed form, skipping every intermediate step. Define and the running product . Then a chain of Gaussians collapses into a single Gaussian:
This is the workhorse of the whole method, so it is worth seeing why it holds. Each forward step multiplies the signal by and adds independent noise. Compose two steps and the signal is multiplied by , while the two independent noise injections add in variance (independent Gaussians combine by summing variances). Bookkeep that all the way down and the signal coefficient becomes , and the variances telescope to exactly . The variance-preserving design is what makes the signal and noise weights square to one, so the total stays put.
The practical form is the one you sample from. Draw a single Gaussian and set
So is a single dial from clean to noise. Two coefficients ride on it: a signal weight and a noise weight . Early on the signal weight is near one and the image is barely touched. By the end the signal weight has collapsed and the noise weight is near one. With the paper's schedule drops from at to about at , so and the residual signal is a rounding error. That is the precise reason is a standard Gaussian regardless of the input. Watch the two weights trade off, and note where they cross:
The crossing matters later. Around it, the image is half signal and half noise, and that middle band is where the reverse process does its real work. Near there is almost nothing to fix, and near there is almost no signal to find.
Reversing it: learn to undo a step
Now the hard direction. We want a reverse process that starts from pure noise and walks back up the chain to a clean image. The true reverse of the forward process, , is intractable, because computing it would require knowing the distribution of all images. So we learn an approximation. The paper's key structural bet is that when the per-step noise is small, the true reverse step is itself very close to Gaussian, so a Gaussian with a learned mean is enough to model it. Define the reverse chain as
It starts at , the same Gaussian the forward process drowns everything into, and each step is a Gaussian whose mean the network predicts. (DDPM does not learn the variance; it fixes to a schedule constant, and reports that learning it hurt. More on when we sample.)
How do you train a chain like this? With the same tool that trains a variational autoencoder: maximize a lower bound on the data's log-likelihood, equivalently minimize an upper bound on the negative log-likelihood. The latents here are just the noisy versions of the image, and the bound is
That single fraction looks unfriendly, but it untangles. Because both chains factor over , the bound rewrites as a sum of per-step terms, and each term turns out to be a comparison between two Gaussians:
Three kinds of terms. has no parameters (the forward process is fixed and ends at a Gaussian, so this is a near-zero constant). is the final decode back to a discrete image. The work is in the middle sum: for each step, the KL between the model's reverse step and a special distribution . That second distribution is the one to understand next, because it is the target the network is chasing, and it has a clean closed form.
The tractable target
The plain reverse step is intractable. But condition it also on the original clean image and it becomes a tidy Gaussian. This is the forward-process posterior: given where you are now () and where you started (), what was the slightly-less-noisy in between?
Do not let the coefficients scare you. The mean is a weighted average of two points you already have: the clean image and the current noisy . One denoising step lands somewhere on the line between them, leaning toward whichever the schedule trusts more at this . And is a small variance, the wobble around that mean. (With the paper's schedule is nearly equal to across most of the chain, e.g. versus at , which is why the two variance choices for sampling behave so similarly.)
This is the target the model copies. Both and the model step are Gaussians with the same fixed variance, so the KL in (5) collapses to a squared distance between their means. Up to a constant,
That is the whole training signal, stated plainly: make the network's predicted mean match the posterior mean . The figure below fixes a clean point and a noisy point and shows where one reverse step lands as you slide :
Predict the noise, not the mean
The network could predict directly. The paper found a better target by substituting the closed form (4) into the posterior mean. Recall . That means the clean image is recoverable from the noisy one if you know the noise: . Plug that into and the algebra simplifies to a mean written entirely in terms of and the noise . So instead of predicting the mean, have the network predict the noise, , and assemble the mean from it:
Read it as a recipe for one reverse step. Take the noisy , subtract a scaled version of the network's noise guess, and rescale by to undo that step's shrink. The network only ever has to answer one question, at every noise level: what noise is in this image?
Why is "guess the noise" the same as "denoise"? Because the noise and the clean image are two ends of the same identity (4): pin one and the other is fixed. Predicting , predicting , and predicting are three views of one quantity. The figure shows the geometry: the clean point gets shrunk to , and a noise vector pushes it out to . That arrow is exactly what the network is asked to return.
Now the payoff. Writing in terms of turns the mean distance (8) into a distance between the true noise and the predicted noise, with a per-step weight out front. The paper then makes one empirical move: drop that weight. The training objective becomes a plain mean-squared error on the noise:
That is the entire loss. Sample an image, sample a timestep, sample noise, build the noised image with one use of (4), ask the network for the noise, penalize the squared error. No adversary, no sampling during training, no chain to unroll. Dropping the weight is not free in theory (it stops being the tightest likelihood bound), but it down-weights the easy near-clean steps and pushes the network to spend its capacity on the harder middle of the chain, and it produced the paper's best samples. The connection the authors prize is right here: is denoising score matching across noise levels, the objective from Song and Ermon, and is, up to scale, an estimate of the score (this is Tweedie's formula: a Gaussian denoiser is a score estimator). Two research lines, one objective.
Sampling: the walk back
Training is done. To generate, run the reverse chain. Start from pure noise and step down to . Each step computes the mean from the network's noise guess via (11), then adds a dab of fresh noise:
(Note , so this is exactly the mean from (11) plus the noise term.) The added is what keeps this a proper sample from a stochastic process rather than a deterministic average. The variance is a fixed schedule choice, not learned. The paper tried and and found them comparable; the released code defaults to . On the very last step () the noise is dropped, so and you read off a clean image. Run it and watch noise condense onto the data:
Notice what each step is doing. The network looks at the current noisy image, guesses the noise, and the update nudges the image a little toward the implied clean version, then re-roughens it slightly so the chain can keep exploring. Early steps (high ) are coarse: the noise drowns out detail and the best guess is close to a blurry average. Late steps (low ) are fine: the image is nearly clean and the network sharpens edges. Coarse-to-fine, exactly the order the schedule's middle band predicted.
One training step, one sampling step
Make it fully concrete with CIFAR-10 shapes. An image is numbers, scaled from the integer range linearly into . The schedule has steps, linear from to , so is precomputed once as a length-1000 table. The network is a U-Net that takes a noisy image plus the timestep (fed in through a sinusoidal embedding, the same trick Transformers use for position) and outputs a tensor the same shape as the image: its guess of the noise in every pixel.
A training step. Draw a clean image (shape ). Draw a timestep, say , for which , so the signal weight is and the noise weight is (already mostly noise). Draw (same shape). Build in one line. Feed to the U-Net, get back , and the loss is the single number . Backprop, step, repeat:
# Algorithm 1: training (one step)
x0 = sample_batch() # clean image, scaled to [-1, 1]
t = randint(1, T) # uniform timestep
eps = randn_like(x0) # the noise we add (the answer)
xt = sqrt(abar[t]) * x0 + sqrt(1 - abar[t]) * eps # eq (4)
loss = mse(eps, eps_theta(xt, t)) # L_simple, eq (14)
loss.backward(); opt.step()Sampling reverses it. Start from , pure Gaussian noise of shape . For , call the network once, assemble the mean with (11), add (except on the last step), and pass the result down. After 1000 network calls you have a sample :
# Algorithm 2: sampling
x = randn(n, *shape) # x_T ~ N(0, I)
for t in [T, T-1, ..., 1]:
eps = eps_theta(x, t) # the network's noise guess
mean = (x - (1 - alpha[t]) / sqrt(1 - abar[t]) * eps) / sqrt(alpha[t]) # eq 11
z = randn_like(x) if t > 1 else 0
x = mean + sqrt(sigma2[t]) * z # sigma2[t] = beta[t] (fixedlarge)
return x # x_0, a sampleOne thing the shapes make obvious: the cost. Training is cheap per step, just one forward and backward pass on one noised image. Sampling is the expensive part, because it is sequential and needs one network call per timestep. A thousand calls to make a single picture. That is the bill diffusion pays for stable training and sharp samples, and it is the bottleneck every follow-up paper has tried to cut.
So what does it actually do
On unconditional CIFAR-10, the model with the objective reaches an Inception Score of and a FID of , which the paper reports as state of the art for the unconditional setting, ahead of the strong GANs and most class-conditional models (one tuned conditional GAN, StyleGAN2 with ADA, scores lower at ). On LSUN it produced samples on par with ProgressiveGAN, and on CelebA-HQ the faces are sharp. A class of model nobody had pushed to high quality was suddenly competitive with the best.
The ablation is the part that justifies the design. Predicting the noise with the unweighted clearly beat predicting the mean and beat training on the full weighted bound: FID for -prediction with versus for the same parameterization on the true bound. Learning the reverse-process variance, rather than fixing it, made training unstable. The simple choices won.
The honest weakness is log-likelihood. As a likelihood model DDPM is unremarkable: about bits/dim on CIFAR-10, behind the best autoregressive models. The authors trace it to where the bits go. More than half the codelength describes imperceptible pixel-level detail, so the model is an excellent lossy compressor that spends its likelihood budget on things the eye does not see. Trading likelihood for samples is exactly the bet makes by dropping the weighting.
And the cost named above is real: a thousand sequential network calls per image. That single number set the agenda for the next few years. DDIM made the sampler deterministic and skippable so you could use tens of steps instead of a thousand. Latent diffusion moved the whole process into a compressed latent space to cut the per-step cost, which is what put diffusion behind Stable Diffusion. The score-based view this paper made explicit grew into the SDE framework that unified the field. But the spine never changed. Add noise on a fixed schedule, train one network to guess the noise, then walk the noise back out. The destroying half was always free. The paper's contribution was showing that the rebuilding half is just denoising, asked a thousand times.
Questions you might still have
Why does x_T end up as plain N(0, I), no matter the image?
Because the closed form (4) scales the signal by √āₜ and adds noise of variance 1−āₜ. With the paper’s schedule āₜ falls to about 4×10⁻⁵ by t = 1000, so √āₜ ≈ 0.006 (the image is gone) and the variance is ≈ 1. Every starting image lands in the same standard Gaussian.
Why predict the noise ε instead of the clean image x₀ or the mean μ̃?
They carry the same information (any one determines the others through eq 4). But ε-prediction makes the reverse mean (11) fall out cleanly and turns the loss into a plain MSE on a unit-variance target. The authors tried predicting μ̃ and x₀ too; ε-prediction with the unweighted loss gave the best samples.
If the loss drops the weighting from the true bound, is it still a valid objective?
It is a re-weighted variational bound. Dropping the per-t weights down-weights the easy, low-noise steps and lets the network spend capacity on the harder middle steps. It is no longer the tightest likelihood bound, which is why the model’s log-likelihoods are only okay, but it produces better samples.
Footnotes & further reading
- The paper: Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (UC Berkeley, NeurIPS 2020). Code.
- The framework DDPM builds on: Sohl-Dickstein, Weiss, Maheswaranathan, Ganguli, Deep Unsupervised Learning using Nonequilibrium Thermodynamics (2015).
- The denoising-score-matching line the ε-objective coincides with: Song, Ermon, Generative Modeling by Estimating Gradients of the Data Distribution (NCSN, 2019), and Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011).
- The faster, deterministic sampler: Song, Meng, Ermon, Denoising Diffusion Implicit Models (DDIM).
- The continuous-time unification of score-based and diffusion models: Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, Score-Based Generative Modeling through Stochastic Differential Equations.
- Diffusion in a compressed latent space, the basis of Stable Diffusion: Rombach, Blattmann, Lorenz, Esser, Ommer, High-Resolution Image Synthesis with Latent Diffusion Models.
How could this explainer be improved? Found an error, or something unclear? I read every message.