VerifiedarXiv:2403.1810326 min
Diffusion · Foundations

Tutorial on Diffusion Models for Imaging and Vision

The same diffusion model, built four ways.

A variational autoencoder, a noise predictor, a score field, and a stochastic differential equation look like four unrelated methods. Chan's tutorial builds each from the ground up and shows the four are one model seen from four sides.

Explaining the paperTutorial on Diffusion Models for Imaging and VisionStanley H. Chan · Purdue · 2024 · arXiv:2403.18103

There are four standard ways to write down a diffusion model. On the page they look unrelated. They are the same model, and seeing why is the fastest way to actually understand any of them.

Start with the job a generative model is hired to do. You have a pile of real images, faces say, and you want a machine that produces new faces that look just as real, none of them copies of the training set. Phrase it in probability and it gets sharp: real faces follow some distribution p(x)p(\mathbf{x}) over the space of all images, and you want to draw fresh samples from p(x)p(\mathbf{x}). What makes it hard is that you never get to see p(x)p(\mathbf{x}). You only ever hold a finite handful of samples from it, and the distribution itself, the thing that tells a face from static, is exactly what you are missing.

Diffusion models reach p(x)p(\mathbf{x}) through a back door. It is trivially easy to destroy an image: add a little Gaussian noise, then a little more, and after enough steps any face dissolves into featureless static, the same static no matter which face you started from. That forward direction throws information away and needs no learning at all. So learn to run it backward. Train a network to undo one small notch of noise, chain those undo-steps from pure static back to a clean image, and you have a sampler for p(x)p(\mathbf{x}). Every diffusion model is this one idea. What changes between the famous papers is only the language used to describe the undo-step.

Chan's tutorial walks four of those languages in dependency order, and that order is the plan for this page: the variational recipe behind a VAE, which sets up the encode-decode-bound machinery; DDPM, which stacks that recipe into a long noising chain and collapses its loss into "predict the noise"; score matching, which arrives at the same target through a completely different door; and the stochastic differential equation that contains both. None of the four is hard on its own. The reward for holding all four at once is that a fact stated obscurely in one becomes obvious in another.

VAE: the variational recipe

The variational autoencoder is the warm-up, and it introduces every moving part the rest of the page reuses: a latent variable, an encoder and decoder, a lower bound to optimize, and a sampling trick that lets gradients pass through randomness. The setup is two networks facing each other. An encoder takes an image x\mathbf{x} and produces a compact latent code z\mathbf{z}; a decoder takes a code z\mathbf{z} and reconstructs an image. If you also force the codes to follow a simple distribution you can sample, usually a standard Gaussian p(z)=N(0,I)p(\mathbf{z}) = \mathcal{N}(\mathbf{0},\mathbf{I}), then generation is easy: draw a random z\mathbf{z} and decode it.

Because everything is probabilistic, the encoder and decoder are written as conditional distributions. The encoder is approximated by qϕ(zx)q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}), a Gaussian whose mean and variance are output by a network with weights ϕ\boldsymbol{\phi}; the decoder is pθ(xz)p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z}), a Gaussian centered on the network's reconstruction. Why a Gaussian for the encoder and not something more expressive? Honestly, convenience: a diagonal Gaussian has a mean and a variance you can output directly, its samples have a closed form, and the one integral you need against it stays tractable. The field runs on that one tractable choice.

That leaves the objective. You would like to maximize the likelihood the model assigns to real data, logp(x)\log p(\mathbf{x}), but that quantity contains an integral over every possible latent code and is hopeless to compute. So you optimize a quantity you can compute instead, the Evidence Lower BOund (ELBO), which sits below logp(x)\log p(\mathbf{x}) by a measurable amount:

logp(x)=Eqϕ(zx) ⁣[logp(x,z)qϕ(zx)]ELBO(x)  +  DKL ⁣(qϕ(zx)p(zx))\log p(\mathbf{x}) = \underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})}\!\left[\log \frac{p(\mathbf{x},\mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})}\right]}_{\text{ELBO}(\mathbf{x})} \;+\; \mathbb{D}_{\text{KL}}\!\big(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) \,\|\, p(\mathbf{z}\mid\mathbf{x})\big)
(1)

That second term is a Kullback–Leibler divergence, a measure of how far one distribution is from another. It is never negative, so the ELBO can never exceed logp(x)\log p(\mathbf{x}); it is a genuine lower bound, and pushing it up pushes up the likelihood it tracks. One subtlety is easy to misread: the KL in eq (1), the gap between the bound and the truth, compares the encoder qϕ(zx)q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) against the true posterior p(zx)p(\mathbf{z}\mid\mathbf{x}), the genuinely unknowable object. A different KL, against the prior, lives inside the ELBO itself, and the two are not the same term.

Rearranging the ELBO splits it into the two jobs a VAE is actually doing:

ELBO(x)=Eqϕ(zx)[logpθ(xz)]reconstruction: decode z back to x    DKL(qϕ(zx)p(z))prior matching: keep codes Gaussian\text{ELBO}(\mathbf{x}) = \underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x})}\big[\log p_{\boldsymbol{\theta}}(\mathbf{x}\mid\mathbf{z})\big]}_{\text{reconstruction: decode } \mathbf{z} \text{ back to } \mathbf{x}} \;-\; \underbrace{\mathbb{D}_{\text{KL}}\big(q_{\boldsymbol{\phi}}(\mathbf{z}\mid\mathbf{x}) \,\|\, p(\mathbf{z})\big)}_{\text{prior matching: keep codes Gaussian}}
(2)

The first term rewards the decoder for rebuilding the input from its code. The second is the regularizer: it penalizes the encoder when the codes it emits stray from the standard-Gaussian prior, the penalty that keeps the latent space samplable later. The two pull against each other, a perfect reconstruction would memorize an arbitrary code, a perfect Gaussian would ignore the input, and the trained model settles where it does both passably. For Gaussians both terms are closed form: the reconstruction is a scaled squared error between the input and the decode, and the prior KL is a tidy formula in the encoder's predicted mean and variance.

One mechanical obstacle stands in the way, and its solution recurs everywhere downstream. Training needs the gradient of the ELBO with respect to the encoder weights ϕ\boldsymbol{\phi}, but ϕ\boldsymbol{\phi} sits inside a sampling step (you draw z\mathbf{z} from qϕq_{\boldsymbol{\phi}}), and you cannot differentiate through a coin flip. The reparameterization trick moves the randomness out of the way: instead of sampling z\mathbf{z} directly, write it as the encoder's mean and standard deviation applied to a fixed external noise draw,

z=μϕ(x)+σϕ(x)ϵ,ϵN(0,I)\mathbf{z} = \boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}) + \boldsymbol{\sigma}_{\boldsymbol{\phi}}(\mathbf{x}) \odot \boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0},\mathbf{I})
(3)

Now ϵ\boldsymbol{\epsilon} carries the randomness and ϕ\boldsymbol{\phi} only appears in plain differentiable arithmetic, so gradients flow straight through. There is an alternative, the score-function (REINFORCE) estimator, that differentiates the sampling directly without this rewrite, but its gradients are so noisy as to be impractical here, which is why reparameterization is the default. Hold onto this move: writing a noisy quantity as "a clean value plus scaled standard noise" is exactly how the diffusion forward process will be written in the next section.

Before the figure, the claim to check: the ELBO is a lower bound that you raise by closing the gap to the true posterior, and when the encoder matches that posterior the bound becomes tight and the gap is zero. Drag the slider to slide the approximation onto the truth and watch the amber gap collapse while the teal bound fills the bar:

Figure 1 · the evidence lower bound
28%
Top: the true posterior (amber) and the encoder's guess q(z | x). Bottom: the full bar is the evidence logp(x)\log p(\mathbf{x}), split into the ELBO and the KL gap. As qq slides onto the posterior the gap shrinks to zero and the bound goes tight. The VAE optimizes the ELBO because the evidence above it is out of reach.

So a VAE is a one-step story: encode to a latent, decode back, and train the round trip with a bound you can actually compute. A diffusion model is the same story stretched over hundreds of tiny steps, with one decision that makes every step trivial.

DDPM: predict the noise

DDPM (denoising diffusion probabilistic models) takes the VAE's encode-decode idea and makes two changes. The latent is no longer a compressed code; it is the image itself, at the same size, with noise piled on. And there is not one latent but a whole ladder of them, x1\mathbf{x}_1 through xT\mathbf{x}_T, each a little noisier than the last, with x0\mathbf{x}_0 the clean image and xT\mathbf{x}_T pure noise. The "encoder" is fixed and requires no learning: it just adds noise on a schedule.

Each forward step nudges the signal down a hair and adds a little noise:

q(xtxt1)=N ⁣(xt;αtxt1,  (1αt)I)q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}\!\big(\mathbf{x}_t \,;\, \sqrt{\alpha_t}\,\mathbf{x}_{t-1},\; (1-\alpha_t)\,\mathbf{I}\big)
(4)

Here αt\alpha_t is a number just below 1, set by a fixed schedule. The two coefficients are tied on purpose: the signal is scaled by αt\sqrt{\alpha_t} and the noise variance is exactly the leftover 1αt1-\alpha_t. That pairing is what keeps the overall scale from drifting. If the incoming image has unit variance, the outgoing one does too, since αt1+(1αt)=1\alpha_t \cdot 1 + (1-\alpha_t) = 1. The process is called variance preserving for exactly this reason, and it is why diffusion models scale their pixels to a fixed range first: with unit-variance data the cloud neither blows up nor shrinks to a point as noise accumulates, it just loses its structure.1

You would think reaching xt\mathbf{x}_t means simulating all tt steps one by one. It does not, thanks to the single most useful fact in DDPM. Because each step is an affine map plus independent Gaussian noise, composing many steps is still just "scaled clean image plus Gaussian noise," and the variances simply add (you add variances, not standard deviations, and no change-of- variables correction is needed). The whole chain telescopes into one jump:

xt=αˉtx0+1αˉtϵ,αˉt=i=1tαi,ϵN(0,I)\mathbf{x}_t = \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}, \qquad \bar{\alpha}_t = \prod_{i=1}^{t}\alpha_i, \quad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})
(5)

One cumulative number αˉt\bar{\alpha}_t, the product of all the per-step scales, sets where you are on the trip from clean to noise. The two scalars are the dial: at, say, αˉt=0.5\bar{\alpha}_t = 0.5 you get xt=0.71x0+0.71ϵ\mathbf{x}_t = 0.71\,\mathbf{x}_0 + 0.71\,\boldsymbol{\epsilon}, half signal and half noise; as αˉt0\bar{\alpha}_t \to 0 the signal term vanishes and you are left with a standard Gaussian. Notice this has the exact shape of the reparameterization trick from eq (3): a clean value, scaled, plus scaled standard noise. The VAE's sampling step and the diffusion forward step are the same algebraic form. Drag the step tt and watch one cloud of points get scaled toward the origin while noise of variance 1αˉt1-\bar{\alpha}_t fills in:

Figure 2 · the forward process
t = 1
The forward marginal scales the clean signal by αˉt\sqrt{\bar{\alpha}_t} and adds noise of variance 1αˉt1-\bar{\alpha}_t. Drag tt: the clean shape (amber) is scaled down by αˉt\sqrt{\bar{\alpha}_t} while noise climbs to fill the gap, ending as a featureless blob. The readout shows the two tied scalars. You can land at any tt in one shot, no stepping required.

The forward direction is free. The model is the reverse: a network that takes a noised xt\mathbf{x}_t and walks it one step back toward the data. Following the VAE template, each reverse step is a small Gaussian decoder, a network that predicts the mean μθ(xt)\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_t) of the next, cleaner state, and the stack is trained by the same ELBO, which here breaks into one denoising term per noise level. The miracle that makes this practical is that the ideal target for each step is known in closed form. If you are allowed to peek at the clean x0\mathbf{x}_0, the true reverse step q(xt1xt,x0)q(\mathbf{x}_{t-1}\mid\mathbf{x}_t,\mathbf{x}_0) is itself a Gaussian whose mean is a fixed blend of where you are and where you started:

μq(xt,x0)=αt(1αˉt1)1αˉtxt+αˉt1(1αt)1αˉtx0\boldsymbol{\mu}_q(\mathbf{x}_t,\mathbf{x}_0) = \frac{\sqrt{\alpha_t}\,(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}\,\mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\,(1-\alpha_t)}{1-\bar{\alpha}_t}\,\mathbf{x}_0
(6)

So matching the network's mean μθ\boldsymbol{\mu}_{\boldsymbol{\theta}} to this target is, after the algebra clears, just asking the network to recover x0\mathbf{x}_0 from xt\mathbf{x}_t. And there is a slicker target still. Rearranging eq (5) writes the clean image purely in terms of the noised xt\mathbf{x}_t and the noise ϵ\boldsymbol{\epsilon}, so knowing x0\mathbf{x}_0 is the same as knowing the noise that was added. Substituting that in turns the mean into a formula that depends only on a noise estimate ϵ^θ\hat{\boldsymbol{\epsilon}}_{\boldsymbol{\theta}}:

μθ(xt)=1αt(xt1αt1αˉtϵ^θ(xt,t))\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_t) = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\,\hat{\boldsymbol{\epsilon}}_{\boldsymbol{\theta}}(\mathbf{x}_t,t)\right)
(7)

The clean image, the posterior mean of eq (6), and the added noise are three names for the same regression target, and the diffusion ELBO, with its weighting dropped, collapses into one of the simplest objectives in machine learning: show the network a noised image, ask it to name the noise, penalize the squared error.

L(θ)=Ex0,t,ϵ[ϵ^θ(αˉtx0+1αˉtϵ,  t)ϵ2]\mathcal{L}(\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{x}_0,\,t,\,\boldsymbol{\epsilon}}\Big[\big\lVert\, \hat{\boldsymbol{\epsilon}}_{\boldsymbol{\theta}}(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon},\; t) - \boldsymbol{\epsilon} \,\big\rVert^2\Big]
(8)

That the three targets are mathematically interchangeable does not make them practically equal: Ho and collaborators found predicting the noise trained more stably and produced better samples early on than predicting the clean image, and the move to this unweighted loss was a deliberate, quality-improving simplification, not a free identity. Here it is as code. Training is a plain regression loop with no chain to backpropagate through, since eq (5) lets you jump to any tt directly:

# one DDPM training step: predict the noise you added
x0   = sample_data()                 # a clean image, shape [d]
t    = randint(1, T)                 # a random noise level 1..T
eps  = randn_like(x0)                # the noise (this is the target)
xt   = sqrt(abar[t]) * x0 + sqrt(1 - abar[t]) * eps
loss = mse(eps_hat(xt, t), eps)      # one net, every t shares weights
loss.backward()                      # plain regression, no chain to run

Sampling reverses the ladder. Start from pure noise, and at each level use the network's noise estimate to take the reverse step of eq (7), adding a touch of fresh noise at every step except the last (the reverse process is itself stochastic, which keeps the samples diverse):

# DDPM sampling: walk x_T (pure noise) down to a clean x_0
x = randn(d)                         # start: a featureless Gaussian
for t in range(T, 0, -1):
    eps  = eps_hat(x, t)             # the network's noise estimate
    mean = (x - (1 - a[t]) / sqrt(1 - abar[t]) * eps) / sqrt(a[t])
    z    = randn(d) if t > 1 else 0  # inject noise except on the last step
    x    = mean + sigma_q[t] * z     # one reverse step
return x                             # a fresh sample from p(x)

Concretely, with a 12-step schedule, training one update means: draw a clean image, pick a level like t=7t=7, draw ϵ\boldsymbol{\epsilon}, build the noised x7\mathbf{x}_7 from eq (5), ask the network for ϵ^\hat{\boldsymbol{\epsilon}}, and step on the squared error ϵ^ϵ2\lVert\hat{\boldsymbol{\epsilon}} - \boldsymbol{\epsilon}\rVert^2. The ladder of hundreds of latents, the per-step ELBO, the closed-form posterior, all of it reduces to one job, a denoiser trained by least squares. Which raises a question the next section answers from a completely different start: what, exactly, is a denoiser computing?

Score matching and Langevin

Forget diffusion for a moment and ask a bare question: given the data distribution p(x)p(\mathbf{x}), how would you draw samples from it directly? Score-matching Langevin dynamics (SMLD), the line of work from Song and Ermon, answers with two ideas that, assembled, land on the same network DDPM trained, by a road that never mentions a noising chain.

The first idea is the score. For a density p(x)p(\mathbf{x}), its score is the gradient of the log-density,

s(x)=xlogp(x)\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})
(9)

At every point in space this is an arrow pointing in the direction the density climbs fastest, toward where data is dense. (This is the Stein score, a gradient with respect to the data point x\mathbf{x}, not with respect to any parameters; the word "score" is overloaded.2) The score has a property that makes it lovely to estimate: it ignores the normalizing constant. A density is really p(x)=p~(x)/Zp(\mathbf{x}) = \tilde{p}(\mathbf{x})/Z with an intractable ZZ out front to make it integrate to one, but xlogZ=0\nabla_{\mathbf{x}} \log Z = 0, so the intractable normalizer, the single hardest object in the problem, drops out of the score completely. Drag the noise level and watch the field stiffen toward the nearest cluster at low noise and relax toward the global center at high noise:

Figure 3 · the score field
σ = 1.00
The score logpσ(x)\nabla\log p_\sigma(\mathbf{x}) is a vector field: every arrow points where probability mass is densest. At small σ\sigma it points to the nearest data cluster; at large σ\sigma the clusters blur into one hill and the arrows point gently to the global center. A perfect compass toward data, if only you had it.

The second idea turns that compass into a sampler. Langevin dynamics says: start anywhere, repeatedly step a little way along the score, and add a pinch of fresh Gaussian noise each step,

xt+1=xt+τs(xt)+2τz,zN(0,I)\mathbf{x}_{t+1} = \mathbf{x}_t + \tau\,\mathbf{s}(\mathbf{x}_t) + \sqrt{2\tau}\,\mathbf{z}, \qquad \mathbf{z}\sim\mathcal{N}(\mathbf{0},\mathbf{I})
(10)

Without the noise term this is plain gradient ascent on the log-density, and it would march every particle to the nearest peak and stop, giving you the single most likely point, not a sample. The injected noise converts climbing into sampling: it keeps the particle jostling around the high-density regions so that, over time, it visits each region as often as the distribution says it should. Drag the noise multiplier from zero (pure ascent, the cloud collapses onto the peaks and freezes) up through one (the cloud spreads across all the modes and keeps moving, the way a true sample should) and on to too much (the kick overpowers the score and the cloud smears past the data):

Figure 4 · Langevin dynamics
k = 1.00
Particles climb the score toward the data while a Gaussian kick jostles them. At k=0k=0 they collapse onto the peaks (that is mode-finding, not sampling); near k=1k=1 they spread across every cluster and keep moving, reproducing the distribution; above k=1k=1 the noise wins and they scatter. Without the kick it only finds peaks; with it, it samples.

That leaves one gap: you do not have the score, only data. You cannot regress against a target you cannot evaluate. The rescue is denoising score matching (Vincent, 2011), and it is the same trick DDPM used. Blur the data with a known Gaussian, perturbing a clean x\mathbf{x}' to x=x+σz\mathbf{x} = \mathbf{x}' + \sigma\mathbf{z}. For that single Gaussian kernel the score is known exactly, it points straight back along the noise you added, so you can train a network sθ\mathbf{s}_{\boldsymbol{\theta}} to match it:

J(θ)=E[12sθ(x+σz)+zσ2]J(\boldsymbol{\theta}) = \mathbb{E}\left[\tfrac{1}{2}\left\lVert \mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x}' + \sigma\mathbf{z}) + \frac{\mathbf{z}}{\sigma} \right\rVert^2\right]
(11)

The target is z/σ-\mathbf{z}/\sigma, the negative noise scaled by 1/σ1/\sigma. That coefficient is worth slowing down on. The kernel score points straight back along the noise you added, and rewriting it in the unit noise z\mathbf{z} shows where the σ\sigma goes:

xlogq(xx)=xxσ2=σzσ2=zσ\nabla_{\mathbf{x}}\log q(\mathbf{x}\mid\mathbf{x}') = -\frac{\mathbf{x}-\mathbf{x}'}{\sigma^2} = -\frac{\sigma\mathbf{z}}{\sigma^2} = -\frac{\mathbf{z}}{\sigma}

since x=x+σz\mathbf{x} = \mathbf{x}' + \sigma\mathbf{z} makes xx=σz\mathbf{x}-\mathbf{x}' = \sigma\mathbf{z}. The 1/σ21/\sigma^2 on the raw (xx)(\mathbf{x}-\mathbf{x}') form and the 1/σ1/\sigma on the unit-noise form are the same coefficient in two variables — consistent with everything else on this page, including the denoiser-to-score bridge in the next section.3

Stare at eq (11) next to DDPM's eq (8). Both perturb a clean sample with Gaussian noise; both train a network to recover that noise (up to the 1/σ1/\sigma scaling). They are the same regression. The score model and the noise predictor are, mechanically, the same network with the same job. The tutorial then adds the practical touch that makes it work across all noise levels. A score fit at one fixed σ\sigma is only accurate near that blur, useless for the rest of a sampling run that sweeps from high noise to low, so you condition a single network on σ\sigma and train it across the whole ladder the sampler will walk (this is the "noise-conditional score network"), then sample with Langevin from high noise down to low. Why the score model and the noise predictor should coincide so exactly is what the next section settles.

A denoiser is a score

DDPM trains a denoiser. SMLD trains a score model. The two papers read nothing alike, and yet the previous section claimed their objectives are the same regression. The fact underneath is a result from the 1950s that the tutorial leans on without ever naming: Tweedie's formula.4

Set up the cleanest version. Take any clean signal x\mathbf{x} from any distribution, add Gaussian noise to get z=x+σϵ\mathbf{z} = \mathbf{x} + \sigma\boldsymbol{\epsilon}, and ask for the best possible guess of x\mathbf{x} given z\mathbf{z}. "Best" in the least-squares sense has a known answer for any estimator: the one that minimizes Eg(z)x2\mathbb{E}\lVert g(\mathbf{z}) - \mathbf{x}\rVert^2 is the posterior mean E[xz]\mathbb{E}[\mathbf{x}\mid\mathbf{z}], the average of every clean signal that could have produced this noisy z\mathbf{z}. (The tutorial waves at this step as "some magical" move and points to a textbook; the one line behind it is that squared error is always minimized by the conditional mean.) So an optimally trained denoiser D(z)=E[xz]D(\mathbf{z}) = \mathbb{E}[\mathbf{x}\mid\mathbf{z}].

Tweedie's formula says that for Gaussian noise this posterior mean and the score of the noised density are the same arrow, up to a constant:

zlogpσ(z)=E[xz]zσ2=D(z)zσ2\nabla_{\mathbf{z}}\log p_\sigma(\mathbf{z}) = \frac{\mathbb{E}[\mathbf{x}\mid\mathbf{z}] - \mathbf{z}}{\sigma^2} = \frac{D(\mathbf{z}) - \mathbf{z}}{\sigma^2}
(12)

Read the right side in plain terms: "your best guess of the clean signal, minus where you currently are." Of course that vector points from the noisy point toward the clean data, and dividing by σ2\sigma^2 sets its length. The direction in which the noised density increases is, exactly, the direction from noisy toward denoised. The derivation is three lines and uses only the chain rule: the noised density is the clean one blurred by a Gaussian, and differentiating that Gaussian pulls down the clean factor (xz)/σ2(\mathbf{x}-\mathbf{z})/\sigma^2, which the integral turns into the posterior mean. Crucially, it needs only the noise to be Gaussian; the data distribution can be anything at all.

Everything hinges on this. It says estimating a score and denoising an image are the same task, so DDPM (which trains a denoiser) and SMLD (which trains a score) were always training the same object, and Tweedie's constant 1/σ21/\sigma^2 is why eq (11) carries a 1/σ1/\sigma rather than a bare unit. You can check the equality by hand. The figure computes both arrows independently at any probe you drag, the denoiser direction (Dz)/σ2(D-\mathbf{z})/\sigma^2 in amber and the score logpσ\nabla\log p_\sigma in teal, and they sit exactly on top of each other, cosine pinned at 1.000, at every position and every noise level:

Figure 5 · the denoiser is the score
σ = 0.80
A Gaussian mixture blurred by σ\sigma (the amber heat is pσp_\sigma). Drag the probe. The wide amber arrow is (D(z)z)/σ2(D(\mathbf{z})-\mathbf{z})/\sigma^2 with DD the exact posterior mean; the thin teal arrow is the score, computed by a separate route. They coincide everywhere, cosine 1.000. That is Tweedie's formula, and it is why a denoiser and a score are the same network.

With the bridge in hand the two stories merge into one: a diffusion model trains a network that, read one way, denoises, and read the other way, estimates the score, two readings of a single trained object. The last view zooms all the way out and shows that the noising chain itself was a discretized version of a single continuous equation, one that contains DDPM and SMLD as two settings of the same dial.

The SDE that holds both

DDPM adds noise in TT discrete steps; SMLD blurs with a ladder of noise scales σ1<<σN\sigma_1 < \dots < \sigma_N. Shrink the step size toward zero and both become a stochastic differential equation (SDE), the continuous-time description of a particle being pushed around by a smooth force plus constant random buffeting:

dx=f(x,t)driftdt+g(t)diffusiondwd\mathbf{x} = \underbrace{\mathbf{f}(\mathbf{x},t)}_{\text{drift}}\,dt + \underbrace{g(t)}_{\text{diffusion}}\,d\mathbf{w}
(13)

The drift f\mathbf{f} is the deterministic push; the diffusion gg scales the random kick dwd\mathbf{w} (a Wiener process, the continuous limit of adding independent Gaussian steps). The forward SDE only dissolves the data into noise; what matters is running it backward. A 1982 result of Anderson gives the reverse-time SDE: the same process, played in reverse, is another SDE whose drift is corrected by the score,

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dwˉd\mathbf{x} = \big[\,\mathbf{f}(\mathbf{x},t) - g(t)^2\,\nabla_{\mathbf{x}}\log p_t(\mathbf{x})\,\big]\,dt + g(t)\,d\bar{\mathbf{w}}
(14)

To generate, start from noise at the final time and integrate eq (14) backward: the score term contributes a drift toward the data while the diffusion term keeps the path stochastic. The only unknown on the right is that score, the very thing Tweedie told us a denoiser computes, so the network you already trained is the drift of this equation. One coefficient trips people up elsewhere: the score in eq (14) carries the full g(t)2g(t)^2, not half of it. The half-g2g^2 belongs to a different object, the deterministic probability-flow ODE, which the tutorial does not cover; the tutorial's generation story is this stochastic reverse SDE.5

With the reverse SDE as the generator, the unification is one line: DDPM and SMLD are the same eq (13) with different choices of f\mathbf{f} and gg:

Same forward equation, two settings; same reverse equation, same trained score. Drag the time and watch both run: the VP signal slides toward the center and merges into a unit Gaussian with its total variance held near 1, while the VE signal stays put and drowns under a cloud whose variance climbs to 1+σ21+\sigma^2. The variance is tracked two different ways, but both land at the same featureless Gaussian you can sample and then reverse:

Figure 6 · two conventions, one process
√ᾱ=0.55 · σ=0.22
The same noising, two ways. VP (DDPM): the signal mean is shrunk by αˉt\sqrt{\bar{\alpha}_t} while noise fills in, total variance staying near 1, ending at a unit Gaussian. VE (SMLD): the mean is fixed and the variance explodes to 1+σ21+\sigma^2. The amber ticks mark the clean data. Both end at a Gaussian, which is why the SDE view treats them as one equation.

The SDE view also hands over a better sampler. DDPM's reverse step is only a crude first-order solver of eq (14), so its samples carry numerical error. The predictor-corrector sampler fixes that: each step takes one reverse-SDE step (the predictor), then runs a few Langevin steps from eq (10) at the current noise level (the corrector) to nudge the sample back onto the true distribution before moving on. Predictor plus corrector, a numerical SDE step refined by score-based MCMC, the two halves of the page meeting in one algorithm.

Four views, one model

A diffusion model destroys data with Gaussian noise and learns to undo one step of it. That single undo-step has four faithful descriptions, and the value of the tutorial is that it makes you fluent in all four:

The four are not rival methods to choose between. They are four coordinate systems on one object, and a quantity that is mysterious in one, the intractable normalizer, the per-step ELBO, the schedule of scalars, becomes a triviality in another. That is why the tutorial is worth the climb: it gets you to stop seeing four methods and start seeing one. Everything built on top, latent diffusion, classifier-free guidance, the distilled few-step samplers, is a variation played on this one theme.

Provenance Verified against primary literature
Kingma & Welling (2014)VAE, the ELBO, and the reparameterization trick (the variational foundation).
Ho, Jain, Abbeel (2020)DDPM: the forward marginal and the predict-the-noise objective.
Song & Ermon (2019)Score matching with annealed Langevin dynamics (SMLD / NCSN).
Song et al. (2021)The SDE view: forward/reverse-time SDE, VP = DDPM, VE = SMLD.
Vincent (2011); Anderson (1982)Denoising score matching; the reverse-time SDE.
Robbins / Miyasawa / EfronTweedie’s formula: a denoiser is a score (the tutorial leaves it unnamed).
correctionThe tutorial's denoising score-matching loss (eqs 84-86) prints the target as z/σ². The correct coefficient is z/σ: the perturbation is x = x' + σz, so the kernel score −(x−x')/σ² equals −z/σ, not −z/σ². We teach the corrected loss ‖s_θ(x'+σz) + z/σ‖².

Questions you might still have

?

Adding noise destroys the image. How can that ever help you generate one?
You never use the forward noising to generate. You use it as free training data: corrupt a clean image yourself, so you know the answer, and train a network to undo one notch of corruption. Generation runs that learned undo-step backward, from pure noise up to a clean sample. Destroying is easy and needs no learning; all the intelligence is in the reverse.

?

Are DDPM and score-based models the same thing, or just similar?
The same thing. Ho’s DDPM predicts the noise; Song & Ermon’s SMLD estimates the score; Tweedie’s formula says a denoiser and a score are one arrow up to a constant. The SDE view (Song et al. 2021) makes it exact: DDPM is the variance-preserving SDE, SMLD is the variance-exploding SDE, and both run the same reverse-time equation to sample.

?

Why predict the added noise instead of the clean image directly?
Mathematically they are interchangeable: the clean image, the posterior mean, and the noise are three labels for the same regression target. Empirically they are not neutral. Ho et al. found predicting the noise trained more stably and gave better early samples than predicting the clean image, which is why the noise target became standard.

?

Does the tutorial cover deterministic sampling (DDIM / the probability-flow ODE)?
Partly. DDPM’s own reverse chain is stochastic — it injects fresh Gaussian noise at almost every step — but §2.6 of the tutorial covers DDIM, an accelerated, non-Markovian sampler whose injected-noise level σt can be dialed to zero for a fully deterministic update. So the deterministic DDIM sampler is here. What the tutorial does not cover is the probability-flow ODE, the deterministic twin that reproduces the marginals of the reverse-time SDE without injected noise — that is the contribution of Song et al. 2021, which we cover on the score-based SDE page.

?

Is this the same idea as flow matching?
Closely related. Flow matching learns a velocity field that transports noise to data along straight-ish paths, and the diffusion ODE is one instance of that picture. The tutorial here stays in the noise-then-denoise framing; flow matching is its own road to the same destination.

Footnotes & further reading

  1. Variance preservation is a fixed-point property, not an identity for any input: the total variance stays exactly 1 only when the data already has unit variance, and otherwise relaxes toward 1, which is why the pixels are normalized first. We follow the tutorial's (and Ho et al.'s) convention: αt\alpha_t is the per-step scale and αˉt=i=1tαi\bar{\alpha}_t = \prod_{i=1}^t \alpha_i the cumulative product, matching the DDPM page.
  2. The word covers two gradients: the tutorial defines the Stein score xlogp(x)\nabla_{\mathbf{x}}\log p(\mathbf{x}) (§3.2, Eqn 3.3), with respect to the data point, separately from the ordinary score θlogp(x)\nabla_{\boldsymbol{\theta}}\log p(\mathbf{x}) (Eqn 3.4), with respect to the parameters.
  3. Denoising score matching: Pascal Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011). The canonical target for the kernel N(x;x,σ2I)\mathcal{N}(\mathbf{x};\mathbf{x}',\sigma^2\mathbf{I}) is (xx)/σ2=z/σ(\mathbf{x}'-\mathbf{x})/\sigma^2 = -\mathbf{z}/\sigma — the same target the tutorial prints in Eqn 3.11.
  4. Tweedie's formula is usually credited to Robbins (1956) via a personal communication from Maurice Tweedie, with the Gaussian case also in Miyasawa (1961); see Bradley Efron, Tweedie's Formula and Selection Bias (2011). The tutorial uses the identity but never names it.
  5. The score / SDE / reverse-time unification: Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, Score-Based Generative Modeling through Stochastic Differential Equations (2021), explained on our score-based SDE page. The reverse-time SDE is Anderson (1982); the deterministic probability-flow ODE twin is the same paper's, and the tutorial omits it.
  6. The paper: Stanley H. Chan, Tutorial on Diffusion Models for Imaging and Vision (Purdue University, 2024). The VAE foundation is Kingma & Welling, Auto-Encoding Variational Bayes (2014); DDPM is Ho, Jain & Abbeel, Denoising Diffusion Probabilistic Models (2020); flow matching, a kindred road to the same place, is on our flow-matching page.