VerifiedarXiv:2006.1123922 min
Diffusion · Generative models

Denoising Diffusion Probabilistic Models

Destroy an image with noise, then learn the walk back.

The forward half throws an image into static and needs no learning at all. The whole model is one network trained to guess the noise. Run it in reverse and a Gaussian blob turns into a picture.

Explaining the paperDenoising Diffusion Probabilistic ModelsHo, Jain, Abbeel · UC Berkeley · NeurIPS 2020 · arXiv:2006.11239

What if generating an image were just denoising, run eight hundred times in a row?

By 2020 the way to generate a sharp image was a GAN: pit a generator against a discriminator and hope the arms race converges. GANs made beautiful samples and were miserable to train, prone to collapsing onto a handful of outputs and to never quite settling. Likelihood models like autoregressive Transformers and flows trained more honestly but lagged on sample quality. There was an older, quieter idea sitting in the corner, diffusion probabilistic models, that nobody had ever pushed to high quality.

This paper pushed it. The recipe is almost suspiciously plain. Take a clean image and add a little Gaussian noise. Add a little more. Keep going for a thousand tiny steps until the image is indistinguishable from television static. That part is mechanical, no network involved. Now train a single network to undo one step: given a noisy image, guess the noise that was added. To generate, start from pure static and apply that one network over and over, peeling noise off a bit at a time, until an image appears.

On CIFAR-10 it reached a FID of 3.17, state of the art for unconditional generation, finally competitive with the GANs on their own turf. The paper that did it is short, and the math underneath is a tower of five ideas, each one small. We will build them in order: the forward noising, the shortcut that lets you noise to any level in one step, the reverse process and the bound that trains it, the tractable target hiding inside that bound, and the reparameterization that collapses the whole thing into "predict the noise." Stack those and the paper falls out.

The forward process: a slow drown

Start with the easy direction, the one that destroys information. Call the clean image x0\mathbf{x}_0. The forward process (the paper also calls it the diffusion process) is a fixed Markov chain that adds a pinch of Gaussian noise at each of TT steps, producing a sequence of progressively noisier images x1,x2,,xT\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_T. Each step is one Gaussian:

q(xtxt1):=N ⁣(xt; 1βtxt1, βtI)q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) := \mathcal{N}\!\big(\mathbf{x}_t;\ \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\ \beta_t\mathbf{I}\big)(2)

Read the two pieces. The mean is the previous image shrunk by 1βt\sqrt{1-\beta_t}, a hair below one, so the signal fades a touch. The covariance is βtI\beta_t\mathbf{I}, a fresh splash of isotropic noise. The βt\beta_t are a fixed variance schedule, small numbers that say how much noise to add at step tt. They are not learned. In the paper βt\beta_t rises linearly from β1=104\beta_1 = 10^{-4} to βT=0.02\beta_T = 0.02 over T=1000T = 1000 steps. The shrink-then-add structure is deliberate: it keeps the total variance from blowing up, which is why this is called the variance-preserving setup.

Chaining the steps gives the whole forward trajectory as a product of these Gaussians:

q(x1:Tx0):=t=1Tq(xtxt1)q(\mathbf{x}_{1:T} \mid \mathbf{x}_0) := \prod_{t=1}^{T} q(\mathbf{x}_t \mid \mathbf{x}_{t-1})

Because each step only adds noise, the chain has no memory of structure to preserve and the destination is fixed in advance. After enough steps, whatever you started from has dissolved into the same featureless Gaussian. Drag the timestep below and watch a small wordmark drown:

Figure 1 · the forward process
t = 1
A cloud of structured data (here the letters "ip") is shrunk toward the origin and buried under Gaussian noise as the step t climbs. By the last step the shape is gone and only static remains. No network is involved; the forward process is fixed.

That is the entire forward half. It is intentionally dumb. All of the intelligence is going to live in running it backward, and to set that up we need one more fact about the forward chain.

Jump to any noise level in one step

Training will need noisy images at random timesteps. Marching the chain forward step by step to reach x700\mathbf{x}_{700} would be slow. The forward process has a gift that makes this free: you can sample xt\mathbf{x}_t directly from x0\mathbf{x}_0 in closed form, skipping every intermediate step. Define αt:=1βt\alpha_t := 1 - \beta_t and the running product αˉt:=s=1tαs\bar{\alpha}_t := \prod_{s=1}^{t}\alpha_s. Then a chain of Gaussians collapses into a single Gaussian:

q(xtx0)=N ⁣(xt; αˉtx0, (1αˉt)I)q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\big(\mathbf{x}_t;\ \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\ (1-\bar{\alpha}_t)\mathbf{I}\big)(4)

This is the workhorse of the whole method, so it is worth seeing why it holds. Each forward step multiplies the signal by αt\sqrt{\alpha_t} and adds independent noise. Compose two steps and the signal is multiplied by αtαt1\sqrt{\alpha_t}\sqrt{\alpha_{t-1}}, while the two independent noise injections add in variance (independent Gaussians combine by summing variances). Bookkeep that all the way down and the signal coefficient becomes αˉt\sqrt{\bar{\alpha}_t}, and the variances telescope to exactly 1αˉt1-\bar{\alpha}_t. The variance-preserving design is what makes the signal and noise weights square to one, so the total stays put.

The practical form is the one you sample from. Draw a single Gaussian ϵ\boldsymbol{\epsilon} and set

xt=αˉtx0+1αˉtϵ,ϵN(0,I)\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

So αˉt\bar{\alpha}_t is a single dial from clean to noise. Two coefficients ride on it: a signal weight αˉt\sqrt{\bar{\alpha}_t} and a noise weight 1αˉt\sqrt{1-\bar{\alpha}_t}. Early on the signal weight is near one and the image is barely touched. By the end the signal weight has collapsed and the noise weight is near one. With the paper's schedule αˉt\bar{\alpha}_t drops from 0.99990.9999 at t=1t=1 to about 4×1054\times 10^{-5} at t=1000t=1000, so αˉT0.006\sqrt{\bar{\alpha}_T} \approx 0.006 and the residual signal is a rounding error. That is the precise reason xT\mathbf{x}_T is a standard Gaussian regardless of the input. Watch the two weights trade off, and note where they cross:

Figure 2 · the noise schedule
SNR = 1.00
The signal weight √āₜ falls from 1 to near 0; the noise weight √(1−āₜ) rises the other way. They cross near t ≈ 259, where the signal-to-noise ratio is 1. Past the crossing the image is mostly noise. The closed form (4) is what lets training sample any t directly.

The crossing matters later. Around it, the image is half signal and half noise, and that middle band is where the reverse process does its real work. Near t=0t=0 there is almost nothing to fix, and near t=Tt=T there is almost no signal to find.

Reversing it: learn to undo a step

Now the hard direction. We want a reverse process that starts from pure noise and walks back up the chain to a clean image. The true reverse of the forward process, q(xt1xt)q(\mathbf{x}_{t-1}\mid\mathbf{x}_t), is intractable, because computing it would require knowing the distribution of all images. So we learn an approximation. The paper's key structural bet is that when the per-step noise βt\beta_t is small, the true reverse step is itself very close to Gaussian, so a Gaussian with a learned mean is enough to model it. Define the reverse chain as

pθ(x0:T):=p(xT)t=1Tpθ(xt1xt),pθ(xt1xt):=N ⁣(xt1; μθ(xt,t), Σθ(xt,t))p_\theta(\mathbf{x}_{0:T}) := p(\mathbf{x}_T)\prod_{t=1}^{T} p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t), \quad p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t) := \mathcal{N}\!\big(\mathbf{x}_{t-1};\ \boldsymbol{\mu}_\theta(\mathbf{x}_t,t),\ \boldsymbol{\Sigma}_\theta(\mathbf{x}_t,t)\big)(1)

It starts at p(xT)=N(0,I)p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0},\mathbf{I}), the same Gaussian the forward process drowns everything into, and each step is a Gaussian whose mean μθ\boldsymbol{\mu}_\theta the network predicts. (DDPM does not learn the variance; it fixes Σθ=σt2I\boldsymbol{\Sigma}_\theta = \sigma_t^2\mathbf{I} to a schedule constant, and reports that learning it hurt. More on σt\sigma_t when we sample.)

How do you train a chain like this? With the same tool that trains a variational autoencoder: maximize a lower bound on the data's log-likelihood, equivalently minimize an upper bound on the negative log-likelihood. The latents x1:T\mathbf{x}_{1:T} here are just the noisy versions of the image, and the bound is

E ⁣[logpθ(x0)]Eq ⁣[logpθ(x0:T)q(x1:Tx0)]=:L\mathbb{E}\!\left[-\log p_\theta(\mathbf{x}_0)\right] \le \mathbb{E}_q\!\left[-\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}\right] =: L(3)

That single fraction looks unfriendly, but it untangles. Because both chains factor over tt, the bound rewrites as a sum of per-step terms, and each term turns out to be a comparison between two Gaussians:

L=Eq[DKL ⁣(q(xTx0)p(xT))LT+t>1DKL ⁣(q(xt1xt,x0)pθ(xt1xt))Lt1  logpθ(x0x1)L0]L = \mathbb{E}_q\Big[\underbrace{D_{\mathrm{KL}}\!\big(q(\mathbf{x}_T\mid\mathbf{x}_0)\,\|\,p(\mathbf{x}_T)\big)}_{L_T} + \sum_{t>1}\underbrace{D_{\mathrm{KL}}\!\big(q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)\,\|\,p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)\big)}_{L_{t-1}} \;\underbrace{-\,\log p_\theta(\mathbf{x}_0\mid\mathbf{x}_1)}_{L_0}\Big](5)

Three kinds of terms. LTL_T has no parameters (the forward process is fixed and ends at a Gaussian, so this is a near-zero constant). L0L_0 is the final decode back to a discrete image. The work is in the middle sum: for each step, the KL between the model's reverse step pθ(xt1xt)p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t) and a special distribution q(xt1xt,x0)q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0). That second distribution is the one to understand next, because it is the target the network is chasing, and it has a clean closed form.

The tractable target

The plain reverse step q(xt1xt)q(\mathbf{x}_{t-1}\mid\mathbf{x}_t) is intractable. But condition it also on the original clean image x0\mathbf{x}_0 and it becomes a tidy Gaussian. This is the forward-process posterior: given where you are now (xt\mathbf{x}_t) and where you started (x0\mathbf{x}_0), what was the slightly-less-noisy xt1\mathbf{x}_{t-1} in between?

q(xt1xt,x0)=N ⁣(xt1; μ~t(xt,x0), β~tI)q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) = \mathcal{N}\!\big(\mathbf{x}_{t-1};\ \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0),\ \tilde{\beta}_t\mathbf{I}\big)(6)
μ~t(xt,x0):=αˉt1βt1αˉtx0+αt(1αˉt1)1αˉtxt,β~t:=1αˉt11αˉtβt\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0) := \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,\mathbf{x}_0 + \frac{\sqrt{\alpha_t}\,(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,\mathbf{x}_t, \qquad \tilde{\beta}_t := \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\,\beta_t(7)

Do not let the coefficients scare you. The mean μ~t\tilde{\boldsymbol{\mu}}_t is a weighted average of two points you already have: the clean image x0\mathbf{x}_0 and the current noisy xt\mathbf{x}_t. One denoising step lands somewhere on the line between them, leaning toward whichever the schedule trusts more at this tt. And β~t\tilde{\beta}_t is a small variance, the wobble around that mean. (With the paper's schedule β~t\tilde{\beta}_t is nearly equal to βt\beta_t across most of the chain, e.g. 0.010030.01003 versus 0.010040.01004 at t=500t=500, which is why the two variance choices for sampling behave so similarly.)

This is the target the model copies. Both q(xt1xt,x0)q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) and the model step pθp_\theta are Gaussians with the same fixed variance, so the KL in (5) collapses to a squared distance between their means. Up to a constant,

Lt1=Eq ⁣[12σt2μ~t(xt,x0)μθ(xt,t)2]+CL_{t-1} = \mathbb{E}_q\!\left[\frac{1}{2\sigma_t^2}\big\lVert \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0) - \boldsymbol{\mu}_\theta(\mathbf{x}_t,t)\big\rVert^2\right] + C(8)

That is the whole training signal, stated plainly: make the network's predicted mean μθ\boldsymbol{\mu}_\theta match the posterior mean μ~t\tilde{\boldsymbol{\mu}}_t. The figure below fixes a clean point and a noisy point and shows where one reverse step lands as you slide tt:

Figure 3 · the posterior target
t = 500
Given the noisy xₜ and the clean x₀, the posterior says the less-noisy xₜ₋₁ is a Gaussian (the faint cloud) centered on a weighted blend of the two, with variance β̃ₜ. As t falls toward 0, the blend leans harder toward x₀: the network just has to reproduce that center.

Predict the noise, not the mean

The network could predict μ~t\tilde{\boldsymbol{\mu}}_t directly. The paper found a better target by substituting the closed form (4) into the posterior mean. Recall xt=αˉtx0+1αˉtϵ\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}. That means the clean image is recoverable from the noisy one if you know the noise: x0=(xt1αˉtϵ)/αˉt\mathbf{x}_0 = (\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon})/\sqrt{\bar{\alpha}_t}. Plug that into μ~t\tilde{\boldsymbol{\mu}}_t and the algebra simplifies to a mean written entirely in terms of xt\mathbf{x}_t and the noise ϵ\boldsymbol{\epsilon}. So instead of predicting the mean, have the network predict the noise, ϵθ(xt,t)\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t), and assemble the mean from it:

μθ(xt,t)=1αt ⁣(xtβt1αˉtϵθ(xt,t))\boldsymbol{\mu}_\theta(\mathbf{x}_t,t) = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right)(11)

Read it as a recipe for one reverse step. Take the noisy xt\mathbf{x}_t, subtract a scaled version of the network's noise guess, and rescale by 1/αt1/\sqrt{\alpha_t} to undo that step's shrink. The network only ever has to answer one question, at every noise level: what noise is in this image?

Why is "guess the noise" the same as "denoise"? Because the noise and the clean image are two ends of the same identity (4): pin one and the other is fixed. Predicting ϵ\boldsymbol{\epsilon}, predicting x0\mathbf{x}_0, and predicting μ~t\tilde{\boldsymbol{\mu}}_t are three views of one quantity. The figure shows the geometry: the clean point gets shrunk to αˉtx0\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0, and a noise vector pushes it out to xt\mathbf{x}_t. That arrow is exactly what the network is asked to return.

Figure 4 · the noise-prediction target
t = 300
The clean x₀ is scaled down to √āₜ·x₀, then the added noise √(1−āₜ)·ε pushes it to the noisy xₜ. The network outputs ε, the direction of that arrow. Knowing ε is the same as knowing x₀, since (4) ties them together. As t grows the signal shrinks and the noise takes over.

Now the payoff. Writing Lt1L_{t-1} in terms of ϵθ\boldsymbol{\epsilon}_\theta turns the mean distance (8) into a distance between the true noise and the predicted noise, with a per-step weight out front. The paper then makes one empirical move: drop that weight. The training objective becomes a plain mean-squared error on the noise:

Lsimple(θ):=Et,x0,ϵ ⁣[ϵϵθ(αˉtx0+1αˉtϵ, t)2]L_{\text{simple}}(\theta) := \mathbb{E}_{t,\,\mathbf{x}_0,\,\boldsymbol{\epsilon}}\!\left[\big\lVert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\big(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon},\ t\big)\big\rVert^2\right](14)

That is the entire loss. Sample an image, sample a timestep, sample noise, build the noised image with one use of (4), ask the network for the noise, penalize the squared error. No adversary, no sampling during training, no chain to unroll. Dropping the weight is not free in theory (it stops being the tightest likelihood bound), but it down-weights the easy near-clean steps and pushes the network to spend its capacity on the harder middle of the chain, and it produced the paper's best samples. The connection the authors prize is right here: LsimpleL_{\text{simple}} is denoising score matching across noise levels, the objective from Song and Ermon, and ϵθ\boldsymbol{\epsilon}_\theta is, up to scale, an estimate of the score xlogq(xt)\nabla_{\mathbf{x}}\log q(\mathbf{x}_t) (this is Tweedie's formula: a Gaussian denoiser is a score estimator). Two research lines, one objective.

Sampling: the walk back

Training is done. To generate, run the reverse chain. Start from pure noise xTN(0,I)\mathbf{x}_T\sim\mathcal{N}(\mathbf{0},\mathbf{I}) and step down to x0\mathbf{x}_0. Each step computes the mean from the network's noise guess via (11), then adds a dab of fresh noise:

xt1=1αt ⁣(xt1αt1αˉtϵθ(xt,t))+σtz,zN(0,I)\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right) + \sigma_t\,\mathbf{z}, \qquad \mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

(Note 1αt=βt1-\alpha_t = \beta_t, so this is exactly the mean from (11) plus the noise term.) The added σtz\sigma_t\mathbf{z} is what keeps this a proper sample from a stochastic process rather than a deterministic average. The variance σt2\sigma_t^2 is a fixed schedule choice, not learned. The paper tried σt2=βt\sigma_t^2 = \beta_t and σt2=β~t\sigma_t^2 = \tilde{\beta}_t and found them comparable; the released code defaults to σt2=βt\sigma_t^2 = \beta_t. On the very last step (t=1t=1) the noise is dropped, so z=0\mathbf{z}=\mathbf{0} and you read off a clean image. Run it and watch noise condense onto the data:

Figure 5 · reverse sampling
200 denoising steps · xₜ → x₀
Sampling runs the chain backward. Points start as a featureless Gaussian blob and, over many small denoising steps each using equation (11), condense onto the data clusters. This run uses the exact optimal noise-predictor for a mixture; a real ε_θ is a trained network.

Notice what each step is doing. The network looks at the current noisy image, guesses the noise, and the update nudges the image a little toward the implied clean version, then re-roughens it slightly so the chain can keep exploring. Early steps (high tt) are coarse: the noise drowns out detail and the best guess is close to a blurry average. Late steps (low tt) are fine: the image is nearly clean and the network sharpens edges. Coarse-to-fine, exactly the order the schedule's middle band predicted.

One training step, one sampling step

Make it fully concrete with CIFAR-10 shapes. An image is 32×32×3=307232\times 32\times 3 = 3072 numbers, scaled from the integer range {0,,255}\{0,\dots,255\} linearly into [1,1][-1,1]. The schedule has T=1000T=1000 steps, β\beta linear from 10410^{-4} to 0.020.02, so αˉt\bar{\alpha}_t is precomputed once as a length-1000 table. The network ϵθ\boldsymbol{\epsilon}_\theta is a U-Net that takes a noisy image plus the timestep tt (fed in through a sinusoidal embedding, the same trick Transformers use for position) and outputs a tensor the same 32×32×332\times 32\times 3 shape as the image: its guess of the noise in every pixel.

A training step. Draw a clean image x0\mathbf{x}_0 (shape 30723072). Draw a timestep, say t=400t=400, for which αˉ4000.195\bar{\alpha}_{400}\approx 0.195, so the signal weight is αˉ4000.44\sqrt{\bar{\alpha}_{400}}\approx 0.44 and the noise weight is 1αˉ4000.90\sqrt{1-\bar{\alpha}_{400}}\approx 0.90 (already mostly noise). Draw ϵN(0,I)\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) (same shape). Build x400=0.44x0+0.90ϵ\mathbf{x}_{400} = 0.44\,\mathbf{x}_0 + 0.90\,\boldsymbol{\epsilon} in one line. Feed (x400,400)(\mathbf{x}_{400}, 400) to the U-Net, get back ϵθ\boldsymbol{\epsilon}_\theta, and the loss is the single number ϵϵθ2\lVert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\rVert^2. Backprop, step, repeat:

# Algorithm 1: training (one step)
x0    = sample_batch()                  # clean image, scaled to [-1, 1]
t     = randint(1, T)                   # uniform timestep
eps   = randn_like(x0)                  # the noise we add (the answer)
xt    = sqrt(abar[t]) * x0 + sqrt(1 - abar[t]) * eps   # eq (4)
loss  = mse(eps, eps_theta(xt, t))      # L_simple, eq (14)
loss.backward(); opt.step()

Sampling reverses it. Start from x1000\mathbf{x}_{1000}, pure Gaussian noise of shape 30723072. For t=1000,999,,1t = 1000, 999, \dots, 1, call the network once, assemble the mean with (11), add σtz\sigma_t\mathbf{z} (except on the last step), and pass the result down. After 1000 network calls you have a sample x0\mathbf{x}_0:

# Algorithm 2: sampling
x = randn(n, *shape)                    # x_T ~ N(0, I)
for t in [T, T-1, ..., 1]:
    eps = eps_theta(x, t)               # the network's noise guess
    mean = (x - (1 - alpha[t]) / sqrt(1 - abar[t]) * eps) / sqrt(alpha[t])  # eq 11
    z = randn_like(x) if t > 1 else 0
    x = mean + sqrt(sigma2[t]) * z       # sigma2[t] = beta[t]  (fixedlarge)
return x                                 # x_0, a sample

One thing the shapes make obvious: the cost. Training is cheap per step, just one forward and backward pass on one noised image. Sampling is the expensive part, because it is sequential and needs one network call per timestep. A thousand calls to make a single picture. That is the bill diffusion pays for stable training and sharp samples, and it is the bottleneck every follow-up paper has tried to cut.

So what does it actually do

On unconditional CIFAR-10, the model with the LsimpleL_{\text{simple}} objective reaches an Inception Score of 9.469.46 and a FID of 3.173.17, which the paper reports as state of the art for the unconditional setting, ahead of the strong GANs and most class-conditional models (one tuned conditional GAN, StyleGAN2 with ADA, scores lower at 2.672.67). On 256×256256\times 256 LSUN it produced samples on par with ProgressiveGAN, and on CelebA-HQ the faces are sharp. A class of model nobody had pushed to high quality was suddenly competitive with the best.

The ablation is the part that justifies the design. Predicting the noise ϵ\boldsymbol{\epsilon} with the unweighted LsimpleL_{\text{simple}} clearly beat predicting the mean μ~\tilde{\boldsymbol{\mu}} and beat training on the full weighted bound: FID 3.173.17 for ϵ\boldsymbol{\epsilon}-prediction with LsimpleL_{\text{simple}} versus 13.513.5 for the same parameterization on the true bound. Learning the reverse-process variance, rather than fixing it, made training unstable. The simple choices won.

The honest weakness is log-likelihood. As a likelihood model DDPM is unremarkable: about 3.753.75 bits/dim on CIFAR-10, behind the best autoregressive models. The authors trace it to where the bits go. More than half the codelength describes imperceptible pixel-level detail, so the model is an excellent lossy compressor that spends its likelihood budget on things the eye does not see. Trading likelihood for samples is exactly the bet LsimpleL_{\text{simple}} makes by dropping the weighting.

And the cost named above is real: a thousand sequential network calls per image. That single number set the agenda for the next few years. DDIM made the sampler deterministic and skippable so you could use tens of steps instead of a thousand. Latent diffusion moved the whole process into a compressed latent space to cut the per-step cost, which is what put diffusion behind Stable Diffusion. The score-based view this paper made explicit grew into the SDE framework that unified the field. But the spine never changed. Add noise on a fixed schedule, train one network to guess the noise, then walk the noise back out. The destroying half was always free. The paper's contribution was showing that the rebuilding half is just denoising, asked a thousand times.

Provenance Verified against primary literature
DDPM (2020)Ho, Jain, Abbeel: the forward/reverse processes, the variational bound, the ε-parameterization (11), L_simple (14), and the CIFAR-10 numbers.
Official codehojonathanho/diffusion, diffusion_utils_2.py. Confirms the linear β schedule, posterior coefficients (7), and the default sampler.
Sohl-Dickstein et al. (2015)The original diffusion-probabilistic-model framework DDPM builds on.
Song & Ermon (2019)Denoising score matching across noise levels, the connection DDPM makes explicit.
Tweedie / Robbins (1956)A Gaussian denoiser is a score estimator, the bridge between ε-prediction and the score.
correctionNo paper-vs-code discrepancy in the math: the released diffusion_utils_2.py matches equations (4), (7), (11), and Algorithm 2 exactly. One default worth naming: the code samples with σₜ² = βₜ (model_var_type "fixedlarge"), the larger of the two choices the paper says performed similarly.

Questions you might still have

?

Why does x_T end up as plain N(0, I), no matter the image?
Because the closed form (4) scales the signal by √āₜ and adds noise of variance 1−āₜ. With the paper’s schedule āₜ falls to about 4×10⁻⁵ by t = 1000, so √āₜ ≈ 0.006 (the image is gone) and the variance is ≈ 1. Every starting image lands in the same standard Gaussian.

?

Why predict the noise ε instead of the clean image x₀ or the mean μ̃?
They carry the same information (any one determines the others through eq 4). But ε-prediction makes the reverse mean (11) fall out cleanly and turns the loss into a plain MSE on a unit-variance target. The authors tried predicting μ̃ and x₀ too; ε-prediction with the unweighted loss gave the best samples.

?

If the loss drops the weighting from the true bound, is it still a valid objective?
It is a re-weighted variational bound. Dropping the per-t weights down-weights the easy, low-noise steps and lets the network spend capacity on the harder middle steps. It is no longer the tightest likelihood bound, which is why the model’s log-likelihoods are only okay, but it produces better samples.

Footnotes & further reading

  1. The paper: Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (UC Berkeley, NeurIPS 2020). Code.
  2. The framework DDPM builds on: Sohl-Dickstein, Weiss, Maheswaranathan, Ganguli, Deep Unsupervised Learning using Nonequilibrium Thermodynamics (2015).
  3. The denoising-score-matching line the ε-objective coincides with: Song, Ermon, Generative Modeling by Estimating Gradients of the Data Distribution (NCSN, 2019), and Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011).
  4. The faster, deterministic sampler: Song, Meng, Ermon, Denoising Diffusion Implicit Models (DDIM).
  5. The continuous-time unification of score-based and diffusion models: Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, Score-Based Generative Modeling through Stochastic Differential Equations.
  6. Diffusion in a compressed latent space, the basis of Stable Diffusion: Rombach, Blattmann, Lorenz, Esser, Ommer, High-Resolution Image Synthesis with Latent Diffusion Models.