Diffusion · Generative models

Denoising Diffusion Probabilistic Models

Destroy an image with noise, then learn to reverse it.

The forward half throws an image into static and needs no learning at all. The model is a single network trained to guess the noise. Run it in reverse and a Gaussian blob turns into a picture.

Explaining the paperDenoising Diffusion Probabilistic ModelsHo, Jain, Abbeel · UC Berkeley · NeurIPS 2020 · arXiv:2006.11239 ↗

Generating an image is denoising, run a thousand times in a row, starting from pure static.

By 2020 the way to generate a sharp image was a GAN: pit a generator against a discriminator and hope the arms race converges. GANs made beautiful samples and were miserable to train, prone to collapsing onto a handful of outputs and to never quite settling. Likelihood models like autoregressive Transformers and flows trained with a proper likelihood objective but lagged on sample quality. An older idea had been sitting unused since 2015, diffusion probabilistic models, that nobody had ever pushed to high quality.

This paper pushed it. The recipe is plain. Take a clean image and add a little Gaussian noise. Add a little more. Keep going for a thousand tiny steps until the image is indistinguishable from television static. That part is mechanical, no network involved. Now train a single network to undo one step: given a noisy image, guess the noise that was added. To generate, start from pure static and apply that one network over and over, peeling noise off a bit at a time, until an image appears.

The paper is short, and the math underneath comes down to five small ideas: the forward noising, the shortcut that lets you noise to any level in one step, the reverse process and the bound that trains it, the tractable target inside that bound, and the reparameterization that folds everything into "predict the noise." They add up to a method that worked: on CIFAR-10 it reached an FID (Frechet Inception Distance, a sample-quality score where lower is better) of 3.17, state of the art for unconditional generation, finally competitive with the GANs on their own turf.

The forward process: add noise until it is static

The easy direction comes first: the one that destroys information. Call the clean image $\mathbf{x}_0$ . The forward process (the paper also calls it the diffusion process) is a fixed Markov chain that adds a pinch of Gaussian noise at each of $T$ steps, producing a sequence of progressively noisier images $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_T$ . Each step is one Gaussian:

q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) := \mathcal{N}\!\big(\mathbf{x}_t;\ \sqrt{1-\beta_t}\,\mathbf{x}_{t-1},\ \beta_t\mathbf{I}\big)

(2)

The expression has two pieces. The mean is the previous image shrunk by $\sqrt{1-\beta_t}$ , a hair below one, so the signal fades a touch. The covariance is $\beta_t\mathbf{I}$ , a fresh splash of isotropic noise. The $\beta_t$ are a fixed variance schedule, small numbers that say how much noise to add at step $t$ . They are not learned. In the paper $\beta_t$ rises linearly from $\beta_1 = 10^{-4}$ to $\beta_T = 0.02$ over $T = 1000$ steps. The shrink-then-add structure is deliberate, the variance-preserving setup: without the shrink, a thousand successive noise additions would pile variance on top of variance with no ceiling, while with it the total is pinned at one and every image decays into the same standard Gaussian, the one known distribution sampling will later start from.

Chaining the steps gives the forward trajectory as a product of these Gaussians:

q(\mathbf{x}_{1:T} \mid \mathbf{x}_0) := \prod_{t=1}^{T} q(\mathbf{x}_t \mid \mathbf{x}_{t-1})

Because each step only adds noise, the chain has no memory of structure to preserve and the destination is fixed in advance. After enough steps, whatever you started from has dissolved into the same featureless Gaussian. Drag the timestep below and watch a small wordmark drown:

Figure 1 · the forward process

step tt = 1

A cloud of structured data (here the letters "ip") is shrunk toward the origin and buried under Gaussian noise as the step t climbs. By the last step the shape is gone and only static remains. No network is involved; the forward process is fixed.

The forward half ends there, and it does no learning. All of the learning happens in running it backward. That needs one more fact about the forward chain.

Jump to any noise level in one step

Training will need noisy images at random timesteps. Marching the chain forward step by step to reach $\mathbf{x}_{700}$ would be slow. The forward process has a property that avoids the step-by-step march: you can sample $\mathbf{x}_t$ directly from $\mathbf{x}_0$ in closed form, skipping every intermediate step. Define $\alpha_t := 1 - \beta_t$ and the running product $\bar{\alpha}_t := \prod_{s=1}^{t}\alpha_s$ . Then a chain of Gaussians collapses into a single Gaussian:

q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\big(\mathbf{x}_t;\ \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0,\ (1-\bar{\alpha}_t)\mathbf{I}\big)

(4)

This equation reappears throughout the method, so it is worth seeing why it holds. Each forward step multiplies the signal by $\sqrt{\alpha_t}$ and adds independent noise. Compose two steps and the signal is multiplied by $\sqrt{\alpha_t}\sqrt{\alpha_{t-1}}$ , while the two independent noise injections add in variance (independent Gaussians combine by summing variances). Bookkeep that all the way down and the signal coefficient becomes $\sqrt{\bar{\alpha}_t}$ , and the variances telescope to exactly $1-\bar{\alpha}_t$ . The variance-preserving design is what makes the signal and noise weights square to one, so the total stays put. And the single-Gaussian result rests on an asymmetry: independent noises never cancel, so their variances only ever stack, which is why the chain compresses to one Gaussian.

The practical form is the one you sample from. Draw a single Gaussian $\boldsymbol{\epsilon}$ and set

\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

So $\bar{\alpha}_t$ is a single dial from clean to noise. Two coefficients ride on it: a signal weight $\sqrt{\bar{\alpha}_t}$ and a noise weight $\sqrt{1-\bar{\alpha}_t}$ . Early on the signal weight is near one and the image is barely touched. By the end the signal weight has fallen away and the noise weight is near one. With the paper's schedule $\bar{\alpha}_t$ drops from $0.9999$ at $t=1$ to about $4\times 10^{-5}$ at $t=1000$ , so $\sqrt{\bar{\alpha}_T} \approx 0.006$ and the residual signal is a rounding error. That is the precise reason $\mathbf{x}_T$ is a standard Gaussian regardless of the input. The two weights trade off, and they cross:

Figure 2 · the noise schedule

step tSNR = 1.00

The signal weight √āₜ falls from 1 to near 0; the noise weight √(1−āₜ) rises the other way. They cross near t ≈ 259, where the signal-to-noise ratio is 1. Past the crossing the image is mostly noise. The closed form (4) lets training sample any t directly.

That crossing will matter once the chain runs backward: around it the image is half signal and half noise, and the reverse process does its real work in that middle band. Near $t=0$ there is almost nothing to fix, and near $t=T$ there is almost no signal to find.

Reversing it: learn to undo a step

Now the hard direction. We want a reverse process that starts from pure noise and walks back up the chain to a clean image. The true reverse of the forward process, $q(\mathbf{x}_{t-1}\mid\mathbf{x}_t)$ , is intractable, because computing it would require knowing the distribution of all images. So we learn an approximation. The paper's key structural assumption is that when the per-step noise $\beta_t$ is small, the true reverse step is itself very close to Gaussian, so a Gaussian with a learned mean is enough to model it. Define the reverse chain as

p_\theta(\mathbf{x}_{0:T}) := p(\mathbf{x}_T)\prod_{t=1}^{T} p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t), \quad p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t) := \mathcal{N}\!\big(\mathbf{x}_{t-1};\ \boldsymbol{\mu}_\theta(\mathbf{x}_t,t),\ \boldsymbol{\Sigma}_\theta(\mathbf{x}_t,t)\big)

(1)

It starts at $p(\mathbf{x}_T) = \mathcal{N}(\mathbf{0},\mathbf{I})$ , the same Gaussian the forward process drowns everything into, and each step is a Gaussian whose mean $\boldsymbol{\mu}_\theta$ the network predicts. (DDPM does not learn the variance; it fixes $\boldsymbol{\Sigma}_\theta = \sigma_t^2\mathbf{I}$ to a schedule constant, and reports that learning it hurt; the choice of $\sigma_t$ is settled at sampling time.)

How do you train a chain like this? With the same tool that trains a variational autoencoder (see the VAE explainer): maximize a lower bound on the data's log-likelihood (the ELBO), equivalently minimize an upper bound on the negative log-likelihood. The latents $\mathbf{x}_{1:T}$ here are the noisy versions of the image, and the bound is

\mathbb{E}\!\left[-\log p_\theta(\mathbf{x}_0)\right] \le \mathbb{E}_q\!\left[-\log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T}\mid\mathbf{x}_0)}\right] =: L

(3)

That single fraction is hard to read directly, but it simplifies. Because both chains factor over $t$ , the bound rewrites as a sum of per-step terms, and each term is a comparison between two Gaussians:

L = \mathbb{E}_q\Big[\underbrace{D_{\mathrm{KL}}\!\big(q(\mathbf{x}_T\mid\mathbf{x}_0)\,\|\,p(\mathbf{x}_T)\big)}_{L_T} + \sum_{t>1}\underbrace{D_{\mathrm{KL}}\!\big(q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)\,\|\,p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)\big)}_{L_{t-1}} \;\underbrace{-\,\log p_\theta(\mathbf{x}_0\mid\mathbf{x}_1)}_{L_0}\Big]

(5)

Three kinds of terms. $L_T$ has no parameters (the forward process is fixed and ends at a Gaussian, so this is a near-zero constant). $L_0$ is the final decode back to a discrete image. The work is in the middle sum: for each step, the KL between the model's reverse step $p_\theta(\mathbf{x}_{t-1}\mid\mathbf{x}_t)$ and a special distribution $q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)$ . That second distribution is the target the loss compares the model's reverse step against, and it has a clean closed form.

The tractable target

The plain reverse step $q(\mathbf{x}_{t-1}\mid\mathbf{x}_t)$ is intractable. But condition it also on the original clean image $\mathbf{x}_0$ and it becomes a tidy Gaussian. This is the forward-process posterior: given where you are now ( $\mathbf{x}_t$ ) and where you started ( $\mathbf{x}_0$ ), what was the slightly-less-noisy $\mathbf{x}_{t-1}$ in between?

q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) = \mathcal{N}\!\big(\mathbf{x}_{t-1};\ \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0),\ \tilde{\beta}_t\mathbf{I}\big)

(6)

\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0) := \frac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1-\bar{\alpha}_t}\,\mathbf{x}_0 + \frac{\sqrt{\alpha_t}\,(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,\mathbf{x}_t, \qquad \tilde{\beta}_t := \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\,\beta_t

(7)

The coefficients look heavy but reduce to a simple average. The mean $\tilde{\boldsymbol{\mu}}_t$ is a weighted average of two points you already have: the clean image $\mathbf{x}_0$ and the current noisy $\mathbf{x}_t$ . One denoising step lands somewhere on the line between them, pulled toward whichever the schedule weights more heavily at this $t$ . The two coefficients in (7) work that way: one scales $\mathbf{x}_0$ , the other scales $\mathbf{x}_t$ , and the schedule shifts the weight between them, near $t=0$ leaning almost entirely on the clean image, near $t=T$ almost entirely on the noisy one. And $\tilde{\beta}_t$ is a small variance, the wobble around that mean. (With the paper's schedule $\tilde{\beta}_t$ is nearly equal to $\beta_t$ across most of the chain, e.g. $0.01003$ versus $0.01004$ at $t=500$ , which is why the two variance choices for sampling behave so similarly.)

The model is trained to match this target. Both $q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0)$ and the model step $p_\theta$ are Gaussians with the same fixed variance, so the KL in (5) becomes a squared distance between their means. Up to a constant,

L_{t-1} = \mathbb{E}_q\!\left[\frac{1}{2\sigma_t^2}\big\lVert \tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t,\mathbf{x}_0) - \boldsymbol{\mu}_\theta(\mathbf{x}_t,t)\big\rVert^2\right] + C

(8)

The training signal reduces to one instruction: make the network's predicted mean $\boldsymbol{\mu}_\theta$ match the posterior mean $\tilde{\boldsymbol{\mu}}_t$ . The figure below fixes a clean point and a noisy point and shows where one reverse step lands as you slide $t$ :

Figure 3 · the posterior target

step tt = 500

Given the noisy xₜ and the clean x₀, the posterior says the less-noisy xₜ₋₁ is a Gaussian (the faint cloud) centered on a weighted blend of the two, with variance β̃ₜ. As t falls toward 0, the blend leans harder toward x₀: the network only has to reproduce that center.

Predict the noise, not the mean

The network could predict $\tilde{\boldsymbol{\mu}}_t$ directly. The paper found a better target by substituting the closed form (4) into the posterior mean. Recall $\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}$ . That means the clean image is recoverable from the noisy one if you know the noise: $\mathbf{x}_0 = (\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon})/\sqrt{\bar{\alpha}_t}$ . Plug that into $\tilde{\boldsymbol{\mu}}_t$ and the algebra simplifies to a mean written entirely in terms of $\mathbf{x}_t$ and the noise $\boldsymbol{\epsilon}$ . So instead of predicting the mean, have the network predict the noise, $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)$ , and assemble the mean from it:

\boldsymbol{\mu}_\theta(\mathbf{x}_t,t) = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right)

(11)

This is a recipe for one reverse step. Take the noisy $\mathbf{x}_t$ , subtract a scaled version of the network's noise guess, and rescale by $1/\sqrt{\alpha_t}$ to undo that step's shrink. The network computes one output at every noise level: an estimate of the noise present in the image.

Why is "guess the noise" the same as "denoise"? Because the noise and the clean image are two ends of the same identity (4): pin one and the other is fixed. Predicting $\boldsymbol{\epsilon}$ , predicting $\mathbf{x}_0$ , and predicting $\tilde{\boldsymbol{\mu}}_t$ are three views of one quantity. The figure shows the geometry: the clean point gets shrunk to $\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0$ , and a noise vector pushes it out to $\mathbf{x}_t$ . That arrow is exactly what the network outputs.

Figure 4 · the noise-prediction target

step tt = 300

The clean x₀ is scaled down to √āₜ·x₀, then the added noise √(1−āₜ)·ε pushes it to the noisy xₜ. The network outputs ε, the direction of that arrow. Knowing ε is the same as knowing x₀, since (4) ties them together. As t grows the signal shrinks and the noise takes over.

The ε-parameterization now simplifies the loss. Writing $L_{t-1}$ in terms of $\boldsymbol{\epsilon}_\theta$ turns the mean distance (8) into a distance between the true noise and the predicted noise, with a per-step weight out front. The paper then makes one empirical move: drop that weight. The training objective becomes a plain mean-squared error on the noise:

L_{\text{simple}}(\theta) := \mathbb{E}_{t,\,\mathbf{x}_0,\,\boldsymbol{\epsilon}}\!\left[\big\lVert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\big(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon},\ t\big)\big\rVert^2\right]

(14)

That line is the full training loss. Sample an image, sample a timestep, sample noise, build the noised image with one use of (4), ask the network for the noise, penalize the squared error. No adversary, no sampling during training, no chain to unroll. The ε-parameterization also buys a well-conditioned target: $\boldsymbol{\epsilon}$ has unit variance at every timestep, so the regression target keeps the same statistical scale whether the input is nearly clean or nearly static, with no schedule-dependent scaling left for the network to correct for. Dropping the weight has a theoretical cost (it stops being the tightest likelihood bound), but it down-weights the easy near-clean steps and pushes the network to spend its capacity on the harder middle of the chain, and it produced the paper's best samples. The authors highlight one connection here: $L_{\text{simple}}$ is denoising score matching across noise levels, the objective from Song and Ermon, and $\boldsymbol{\epsilon}_\theta$ is, up to scale, an estimate of the score $\nabla_{\mathbf{x}}\log q(\mathbf{x}_t)$ , the gradient of the log-density, which points toward higher-probability (more image-like) configurations. This is Tweedie's formula: a Gaussian denoiser is a score estimator.

Sampling: run the reverse process

Training is done. To generate, run the reverse chain. Start from pure noise $\mathbf{x}_T\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ and step down to $\mathbf{x}_0$ . Each step computes the mean from the network's noise guess via (11), then adds a dab of fresh noise:

\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\!\left(\mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\boldsymbol{\epsilon}_\theta(\mathbf{x}_t,t)\right) + \sigma_t\,\mathbf{z}, \qquad \mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

(Note $1-\alpha_t = \beta_t$ , so this is exactly the mean from (11) plus the noise term.) The added $\sigma_t\mathbf{z}$ is what keeps this a proper sample from a stochastic process rather than a deterministic average. The variance $\sigma_t^2$ is a fixed schedule choice, not learned. The paper tried $\sigma_t^2 = \beta_t$ and $\sigma_t^2 = \tilde{\beta}_t$ and found them comparable; the released code defaults to $\sigma_t^2 = \beta_t$ . On the very last step ( $t=1$ ) the noise is dropped, so $\mathbf{z}=\mathbf{0}$ and you read off a clean image. Run it and watch noise condense onto the data:

Figure 5 · reverse sampling

200 denoising steps · xₜ → x₀

Sampling runs the chain backward. Points start as a featureless Gaussian blob and, over many small denoising steps each using equation (11), condense onto the data clusters. This run uses the exact optimal noise-predictor for a mixture; a real ε_θ is a trained network.

Each step feeds the current noisy image to the network, which outputs a noise estimate, and the update nudges the image a little toward the implied clean version, then re-roughens it slightly so the chain can keep exploring. Early steps (high $t$ ) are coarse: the noise washes out detail and the best guess is close to a blurry average. Late steps (low $t$ ) are fine: the image is nearly clean and the network sharpens edges.

One training step, one sampling step

Make it fully concrete with CIFAR-10 shapes. An image is $32\times 32\times 3 = 3072$ numbers, scaled from the integer range $\{0,\dots,255\}$ linearly into $[-1,1]$ . The schedule has $T=1000$ steps, $\beta$ linear from $10^{-4}$ to $0.02$ , so $\bar{\alpha}_t$ is precomputed once as a length-1000 table. The network $\boldsymbol{\epsilon}_\theta$ is a U-Net (the encoder-decoder image-to-image architecture; see the U-Net explainer) that takes a noisy image plus the timestep $t$ (fed in through a sinusoidal embedding, the same trick Transformers use for position) and outputs a tensor the same $32\times 32\times 3$ shape as the image: its guess of the noise in every pixel.

A training step. Draw a clean image $\mathbf{x}_0$ (shape $3072$ ). Draw a timestep, say $t=400$ , for which $\bar{\alpha}_{400}\approx 0.195$ , so the signal weight is $\sqrt{\bar{\alpha}_{400}}\approx 0.44$ and the noise weight is $\sqrt{1-\bar{\alpha}_{400}}\approx 0.90$ (already mostly noise). Draw $\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ (same shape). Build $\mathbf{x}_{400} = 0.44\,\mathbf{x}_0 + 0.90\,\boldsymbol{\epsilon}$ in one line. Feed $(\mathbf{x}_{400}, 400)$ to the U-Net, get back $\boldsymbol{\epsilon}_\theta$ , and the loss is the single number $\lVert \boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta\rVert^2$ . Backprop, step, repeat:

# Algorithm 1: training (one step)
x0    = sample_batch()                  # clean image, scaled to [-1, 1]
t     = randint(1, T)                   # uniform timestep
eps   = randn_like(x0)                  # the noise we add (the answer)
xt    = sqrt(abar[t]) * x0 + sqrt(1 - abar[t]) * eps   # eq (4)
loss  = mse(eps, eps_theta(xt, t))      # L_simple, eq (14)
loss.backward(); opt.step()

Sampling reverses it. Start from $\mathbf{x}_{1000}$ , pure Gaussian noise of shape $3072$ . For $t = 1000, 999, \dots, 1$ , call the network once, assemble the mean with (11), add $\sigma_t\mathbf{z}$ (except on the last step), and pass the result down. After 1000 network calls you have a sample $\mathbf{x}_0$ :

# Algorithm 2: sampling
x = randn(n, *shape)                    # x_T ~ N(0, I)
for t in [T, T-1, ..., 1]:
    eps = eps_theta(x, t)               # the network's noise guess
    mean = (x - (1 - alpha[t]) / sqrt(1 - abar[t]) * eps) / sqrt(alpha[t])  # eq 11
    z = randn_like(x) if t > 1 else 0
    x = mean + sqrt(sigma2[t]) * z       # sigma2[t] = beta[t]  (fixedlarge)
return x                                 # x_0, a sample

Written out in shapes like this, the cost is obvious too. Training is cheap per step: one forward and backward pass on one noised image. Sampling is the expensive part, because it is sequential and needs one network call per timestep, a thousand calls to make a single picture. This sequential sampling is the price of diffusion's stable training and sharp samples, and the bottleneck every follow-up paper has tried to cut.

The shapes also make clear what kind of data this works on. Every equation here noises a continuous vector: the closed form (4) multiplies $\mathbf{x}_0$ by $\sqrt{\bar{\alpha}_t}$ and adds real-valued Gaussian noise, so $\mathbf{x}_0$ has to be something you can scale and average. For images that is pixels, the 3072 real numbers above. The same recipe runs on any other continuous tensor, an audio waveform or a learned embedding, with nothing else changed. Text breaks this assumption. You cannot add a little Gaussian noise to a token id; the integer 4017 plus 0.3 is not a word, and halfway between two token ids is nothing. The usual workaround is to diffuse in a continuous space the tokens map into: embed the length- $L$ sequence of ids into a continuous $[L, d]$ embedding tensor, noise that tensor with (4), and snap the denoised result back to the nearest tokens at the end. The discrete ids themselves are never noised, only their continuous embeddings are. So vanilla DDPM is a continuous-data method, not a text model; making diffusion work on language took extra machinery, and that is its own line of papers, not anything in this one.

Sharp samples, mediocre likelihood

On unconditional CIFAR-10, the model with the $L_{\text{simple}}$ objective reaches an Inception Score of $9.46$ (a sample-quality score where higher is better) and a FID of $3.17$ , which the paper reports as state of the art for the unconditional setting, ahead of the strong GANs and most class-conditional models (one tuned conditional GAN, StyleGAN2 with ADA, scores lower at $2.67$ ). On $256\times 256$ LSUN it produced samples on par with ProgressiveGAN, and on CelebA-HQ the faces are sharp.

The ablations put these design choices side by side. Predicting the noise $\boldsymbol{\epsilon}$ with the unweighted $L_{\text{simple}}$ clearly beat predicting the mean $\tilde{\boldsymbol{\mu}}$ and beat training on the full weighted bound: FID $3.17$ for $\boldsymbol{\epsilon}$ -prediction with $L_{\text{simple}}$ versus $13.5$ for the same parameterization on the true bound. Learning the reverse-process variance, rather than fixing it, made training unstable.

As a likelihood model DDPM is unremarkable: about $3.75$ bits/dim on CIFAR-10 (bits per dimension: the average number of bits needed to encode each pixel value, lower meaning a tighter likelihood), behind the best autoregressive models. The authors trace it to where the bits go. More than half the codelength describes imperceptible pixel-level detail, so the model is an excellent lossy compressor that spends its likelihood budget on things the eye does not see. Trading likelihood for samples is exactly what $L_{\text{simple}}$ does by dropping the weighting.

And the cost is real: a thousand sequential network calls per image. That single number shaped the research direction for the next few years. DDIM (Denoising Diffusion Implicit Models) made the sampler deterministic and skippable so you could use tens of steps instead of a thousand. Latent diffusion moved the process into a compressed latent space to cut the per-step cost, which is what put diffusion behind Stable Diffusion. The score-based view this paper made explicit grew into the SDE (stochastic-differential-equation) framework that unified the field (see the Score-SDE explainer). Every one of those follow-ups kept the same spine: add noise on a fixed schedule, train one network to guess the noise, walk the noise back out. Destroying the image never needed a network. What this paper established is that rebuilding it is the same denoising step, asked a thousand times over.

Provenance Verified against primary literature

DDPM (2020)Ho, Jain, Abbeel: the forward/reverse processes, the variational bound, the ε-parameterization (11), L_simple (14), and the CIFAR-10 numbers.

Official codehojonathanho/diffusion, diffusion_utils_2.py. Confirms the linear β schedule, posterior coefficients (7), and the default sampler.

Sohl-Dickstein et al. (2015)The original diffusion-probabilistic-model framework DDPM builds on.

Song & Ermon (2019)Denoising score matching across noise levels, the connection DDPM makes explicit.

Tweedie / Robbins (1956)A Gaussian denoiser is a score estimator, the bridge between ε-prediction and the score.

caveatNo paper-vs-code discrepancy in the math: the released diffusion_utils_2.py matches equations (4), (7), (11), and Algorithm 2 exactly. One default to name: the code samples with σₜ² = βₜ (model_var_type "fixedlarge"), the larger of the two choices the paper says performed similarly.

Questions you might still have

Why does x_T end up as plain N(0, I), no matter the image?
Because the closed form (4) scales the signal by √āₜ and adds noise of variance 1−āₜ. With the paper’s schedule āₜ falls to about 4×10⁻⁵ by t = 1000, so √āₜ ≈ 0.006 (the image is gone) and the variance is ≈ 1. Every starting image lands in the same standard Gaussian.

Why predict the noise ε instead of the clean image x₀ or the mean μ̃?
They carry the same information (any one determines the others through eq 4). But ε-prediction makes the reverse mean (11) fall out cleanly and turns the loss into a plain MSE on a unit-variance target. The authors tried predicting μ̃ and x₀ too; ε-prediction with the unweighted loss gave the best samples.

If the loss drops the weighting from the true bound, is it still a valid objective?
It is a re-weighted variational bound. Dropping the per-t weights down-weights the easy, low-noise steps and lets the network spend capacity on the harder middle steps. It is no longer the tightest likelihood bound, which is why the model’s log-likelihoods are only okay, but it produces better samples.

Footnotes & further reading

The paper: Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (UC Berkeley, NeurIPS 2020). Code.
The framework DDPM builds on: Sohl-Dickstein, Weiss, Maheswaranathan, Ganguli, Deep Unsupervised Learning using Nonequilibrium Thermodynamics (2015).
The denoising-score-matching line the ε-objective coincides with: Song, Ermon, Generative Modeling by Estimating Gradients of the Data Distribution (NCSN, 2019), and Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011).
The faster, deterministic sampler: Song, Meng, Ermon, Denoising Diffusion Implicit Models (DDIM).
The continuous-time unification of score-based and diffusion models: Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, Score-Based Generative Modeling through Stochastic Differential Equations.
Diffusion in a compressed latent space, the basis of Stable Diffusion: Rombach, Blattmann, Lorenz, Esser, Ommer, High-Resolution Image Synthesis with Latent Diffusion Models.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.