VerifiedarXiv:2011.1345626 min
Diffusion · Theory

Score-Based Generative Modeling through Stochastic Differential Equations

Noising data is a smooth diffusion you can run backward to make new data.

Take the noising in a diffusion model to its continuous limit and it becomes one stochastic differential equation. Reversing that equation generates data, and it needs only the score: at every point, the direction toward where data is dense.

Explaining the paperScore-Based Generative Modeling through Stochastic Differential EquationsSong, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole · Stanford & Google Brain · ICLR 2021 · arXiv:2011.13456

Adding noise to a picture until it is static takes no skill. Removing it until a new picture appears is the whole problem of generative modeling.

By late 2020 two families of generative models had converged on the same recipe without anyone saying so out loud. One, score matching with Langevin dynamics, took a clean image, added Gaussian noise at a ladder of growing scales, and trained a network to estimate the score at each rung, then walked back down the ladder with a noisy sampler. The other, denoising diffusion, defined a chain that corrupted an image one small step at a time and trained a network to undo each step. The two looked like different animals. They were trained differently, sampled differently, and motivated by different math.

This paper's move is to take the ladder of noise scales and let the number of rungs go to infinity. A ladder with infinitely many, infinitely close rungs is not a ladder anymore, it is a continuous process, and a continuous process driven by random noise has a standard name: a stochastic differential equation, an SDE. Once the noising is one continuous SDE, generation is the same SDE run backward in time, and a classic result says exactly how to run a diffusion backward. The reverse process needs one ingredient, the score of the noised data at each instant, which is precisely what both earlier families were already learning.

Seeing it this way buys two things. The two rival methods are discretizations of two specific SDEs, so the field's split was a matter of bookkeeping. And the continuous picture hands you machinery the discrete one hid: a deterministic sampler that computes exact likelihoods, a way to clean up a sampler's errors mid-run, and a way to steer generation toward a class or fill in a missing region, all from a single model trained once.

A few ideas carry the argument: noising taken to a continuous SDE, the reverse-time SDE that needs only the score, learning that score by plain denoising, and the deterministic ODE that the same score yields. Each is approachable on its own.

From a ladder of noise to a continuous flow

Picture every image a model could produce as a point in an enormous space. A modest photo has hundreds of thousands of pixels, so it is a point in a space of hundreds of thousands of dimensions, and real photos occupy almost none of it. Scramble the pixels at random and you get static, essentially never a face. The real data sits on a thin, folded, lower-dimensional surface floating in a vast emptiness. Generation is the problem of landing on that surface starting from somewhere random in the void.

The paper studies that surface by first watching it dissolve. Define a process x(t)\mathbf{x}(t) over a time variable tt running from 00 to TT. At t=0t=0 the process is a real data sample; as tt grows, noise is mixed in continuously until, at t=Tt=T, every trace of the data is gone and what remains is a fixed, structureless cloud, the prior, a plain Gaussian you can sample from trivially. Written as an SDE, the rule for the forward process is:

dx=f(x,t)dt+g(t)dwd\mathbf{x} = \mathbf{f}(\mathbf{x}, t)\,dt + g(t)\,d\mathbf{w}
(5)

Read it as a recipe for one infinitesimal step. The first term, f(x,t)dt\mathbf{f}(\mathbf{x}, t)\,dt, is the drift: a deterministic nudge that depends on where you are. The second term, g(t)dwg(t)\,d\mathbf{w}, is the diffusion: a kick of fresh Gaussian noise, where w\mathbf{w} is Brownian motion (the continuous-time random walk) and the scalar g(t)g(t) sets how hard the kick is at time tt. Drift plus a random kick, applied over and over in vanishingly small steps, is all an SDE is. Crucially this forward process is prescribed: it has no learned parameters and depends on nothing about the data. We choose f\mathbf{f} and gg so the data reliably ends up as the simple prior, and that is the only requirement.

Drag time forward and watch a set of data clusters come apart into the prior. One particle's actual jagged path is drawn on top, so you can see the random walk underneath the smooth spreading of the cloud:

Figure 1 · the forward SDE
t = 0.00
The forward SDE carries data (the amber clusters) to a fixed Gaussian prior as time tt runs from 0 to 1. Drag tt and the structure washes out into one featureless blob at the origin. The pale line is a single particle's real stochastic trajectory: drift plus a random kick at every step.

The forward direction destroys information and needs no learning at all. What takes learning is the return trip, from the featureless prior at t=Tt=T back onto the data surface at t=0t=0.

Reverse the flow, and you only need the score

Running a random process backward sounds hopeless. Stir cream into coffee and you cannot un-stir it. But a result from Brian Anderson in 1982 says something stronger holds for diffusions: the time-reversal of a diffusion is itself a diffusion, with its own drift and the same size of random kick, and the reverse drift can be written down exactly. For the forward SDE (5), the reverse-time process is:

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dwˉd\mathbf{x} = \big[\mathbf{f}(\mathbf{x}, t) - g(t)^2\,\nabla_{\mathbf{x}}\log p_t(\mathbf{x})\big]\,dt + g(t)\,d\bar{\mathbf{w}}
(6)

Here pt(x)p_t(\mathbf{x}) is the distribution of the noised data at time tt, the cloud you watched the clusters dissolve into, and wˉ\bar{\mathbf{w}} is Brownian motion for time flowing backward from TT to 00 (the dtdt here is a negative step). Next to the forward SDE, almost everything reappears: the same f\mathbf{f} and the same g(t)g(t) on the noise. The drift is no longer just f\mathbf{f}, though. One new object is subtracted from it, the term xlogpt(x)\nabla_{\mathbf{x}}\log p_t(\mathbf{x}), and that single subtraction flips the forward process into its reverse.

That term is called the score. It is the gradient of the log-density of the noised data with respect to the data point itself (the Stein score: a gradient in data space, as opposed to the Fisher score from statistics, the gradient with respect to a model's parameters). At any location x\mathbf{x} it points in the direction the noised density climbs fastest, toward wherever data is densest nearby, a compass needle for generation. If you knew it at every noise level, generation would be mechanical: drop a particle in the prior and let the reverse SDE (6) carry it home.

Two features make the score the right object to chase. It ignores normalizing constants. A density carries a 1/Z1/Z out front, where ZZ is the total probability mass, the integral of the unnormalized density over the entire high-dimensional space, and that integral is rarely computable, which is exactly why likelihood-based modeling is hard: methods that need it, like normalizing flows, bend their architecture around staying able to compute it. The score never has to, because the gradient of a log turns a product into a sum:

xlog ⁣(p/Z)=xlogpxlogZ\nabla_{\mathbf{x}} \log\!\big(p/Z\big) = \nabla_{\mathbf{x}}\log p - \nabla_{\mathbf{x}}\log Z

and the second term vanishes because ZZ does not depend on x\mathbf{x}. The score deletes ZZ before it can cause trouble. It also behaves the way intuition demands as the noise level changes: at low noise the density is sharp, concentrated tightly on the data, so the score points firmly at the nearest cluster; at high noise the data has blurred into one broad hill, so the score points weakly toward the center of mass.

Play with the noise level and watch the field stiffen and relax: sharp and decisive near the data when noise is low, soft and global when noise is high:

Figure 2 · the score field
σ = 1.00
The score logpσ(x)\nabla\log p_\sigma(\mathbf{x}) is a vector field: at every point it aims where the noised density is densest. At low noise it points firmly at the nearest data cluster; at high noise the clusters blur together and it points gently to the global center. The noise level is large early in generation and shrinks toward zero at the end.

So the entire problem has been squeezed down to one estimation task: get the score xlogpt(x)\nabla_{\mathbf{x}}\log p_t(\mathbf{x}) at every noise level, plug it into the reverse SDE, and integrate from prior to data. Which would still be daunting, except that estimating the score is a plain supervised-learning problem in disguise.

Learning the score by denoising

You cannot directly supervise "the score," because you have no table of ground-truth arrows to regress against. The trick that has been known since Hyvärinen's score matching (2005) is that you do not need one. There is an objective whose minimizer is exactly the score, and a denoising version of it, due to Vincent (2011), that you can write down and optimize directly. Train a time-dependent network sθ(x,t)\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x}, t) to minimize:

θ=argminθEt{λ(t)Ex(0)Ex(t)x(0)[sθ(x(t),t)x(t)logp0t(x(t)x(0))22]}\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}}\,\mathbb{E}_t\Big\{\lambda(t)\,\mathbb{E}_{\mathbf{x}(0)}\,\mathbb{E}_{\mathbf{x}(t)\mid\mathbf{x}(0)}\big[\,\big\lVert \mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x}(t), t) - \nabla_{\mathbf{x}(t)}\log p_{0t}(\mathbf{x}(t)\mid\mathbf{x}(0))\big\rVert_2^2\,\big]\Big\}
(7)

The expression is long but the idea is short. Pick a clean sample x(0)\mathbf{x}(0) and a random time tt, add the prescribed noise to get x(t)\mathbf{x}(t), and ask the network to match the score of the known noising kernel p0t(x(t)x(0))p_{0t}(\mathbf{x}(t)\mid\mathbf{x}(0)), the single Gaussian that carried this particular x(0)\mathbf{x}(0) to this x(t)\mathbf{x}(t). The reason this works is Vincent's result, and the intuition behind it is short: the full noised density ptp_t is the mixture of every per-sample Gaussian, so its true score at a point is the average of the individual kernel scores, weighted by how likely each clean sample was to produce that noised point. Any single-sample target is a noisy draw from that average, so regressing on it drives the network to the true score in expectation. The single-sample and full-density objectives differ only by a constant that does not depend on θ\boldsymbol{\theta}, so they share a gradient, and the cheap target trains the expensive one.

And the cheap target is genuinely cheap. As long as the drift f\mathbf{f} is affine (linear in x\mathbf{x}), the noising kernel is a Gaussian whose mean and variance have closed forms, so its score is a one-liner: a Gaussian of mean μ\boldsymbol{\mu} and standard deviation σ\sigma has score (μx)/σ2(\boldsymbol{\mu} - \mathbf{x})/\sigma^2, which just points from your noised sample back toward the clean mean. Building the training target is one subtraction and one divide. The weighting λ(t)\lambda(t) rebalances the noise levels: the squared-error target is huge at low noise and tiny at high noise, so setting λ(t)\lambda(t) inversely proportional to the typical score magnitude at each level flattens those scales and every level pulls on θ\boldsymbol{\theta} with comparable force.

# learn the time-dependent score by denoising
x0     = sample_data()                 # a clean sample x(0)
t      = uniform(0, T)                 # a random time
mean, std = perturb_kernel(x0, t)      # closed form: affine drift
x_t    = mean + std * randn_like(x0)   # the noised sample x(t)
target = -(x_t - mean) / std**2        # score of that Gaussian kernel
loss   = lam(t) * mse(s_theta(x_t, t), target)
loss.backward()                        # one gradient step on theta

The training loop has three moves: corrupt a sample by a known Gaussian, regress the network onto the direction back toward the clean mean, and repeat across all times. No sampling during training, no adversary, no chain to unroll. A single regression objective leaves you with a network that approximates the score at every noise level, the one ingredient the reverse SDE (6) requires.

One framework, two conventions: VE and VP

With the continuous picture in hand, the two rival methods stop looking like rivals. Each is the continuous limit of a particular noising scheme, which is to say each is a particular choice of f\mathbf{f} and gg in the forward SDE (5).

The score-matching line (Song & Ermon, 2019) added noise without ever touching the signal:

xi=xi1+σi2σi12zi1\mathbf{x}_i = \mathbf{x}_{i-1} + \sqrt{\sigma_i^2 - \sigma_{i-1}^2}\,\mathbf{z}_{i-1}
(8)

piling on larger and larger Gaussian noise at each step. Taken to the limit, that chain has no drift at all, only a growing diffusion:

dx=d[σ2(t)]dtdwd\mathbf{x} = \sqrt{\tfrac{d[\sigma^2(t)]}{dt}}\,d\mathbf{w}
(9)

Because nothing pulls the signal back, the variance of the process grows without bound as time runs on, so the paper names this the Variance Exploding (VE) SDE. The prior it heads toward is a very wide Gaussian, wide enough to swamp any data.

DDPM (Ho et al., 2020) instead shrank the signal a little on every step as it added noise:

xi=1βixi1+βizi1\mathbf{x}_i = \sqrt{1-\beta_i}\,\mathbf{x}_{i-1} + \sqrt{\beta_i}\,\mathbf{z}_{i-1}
(10)

The 1βi\sqrt{1-\beta_i} factor is a deliberate counterweight. Scaling the data by 1βi\sqrt{1-\beta_i} multiplies its variance by (1βi)(1-\beta_i), the fresh noise adds back βi\beta_i, and the two sum to one, so a unit-variance input stays exactly unit-variance. Its continuous limit therefore carries a drift that pulls toward the origin:

dx=12β(t)xdt+β(t)dwd\mathbf{x} = -\tfrac{1}{2}\beta(t)\,\mathbf{x}\,dt + \sqrt{\beta(t)}\,d\mathbf{w}
(11)

With unit-variance data the total variance stays pinned at 1 for all time, so this is the Variance Preserving (VP) SDE, and its prior is the standard unit Gaussian. DDPM's own step-by-step sampler is one particular discretization of this SDE run in reverse. The choice between VE and VP is the choice between leaving the signal alone while noise floods in, or shrinking the signal so the cloud never widens.

The paper also proposes a third option, the sub-VP SDE, built by damping the VP noise term so the variance dips below the VP curve through the middle of the trip:

dx=12β(t)xdt+β(t)(1e20tβ(s)ds)dwd\mathbf{x} = -\tfrac{1}{2}\beta(t)\,\mathbf{x}\,dt + \sqrt{\beta(t)\big(1 - e^{-2\int_0^t \beta(s)\,ds}\big)}\,d\mathbf{w}
(12)

That extra factor vanishes at t=0t=0 and climbs toward 1, so sub-VP injects less noise than VP early on while still reaching the same prior. Likelihood is dominated by the low-noise end, where the fine detail of an image lives, and a gentler early schedule lets the score network spend its capacity resolving that detail instead of fighting noise it did not need to add. That is why sub-VP posts the best bits-per-dim of the three, and why it is the schedule to reach for when exact likelihood is the goal.

Scrub time and compare the three. The claim to check is the one the names make: VE's variance runs off the top of the chart, VP sits exactly on the variance-equals-1 line, and sub-VP dips to about 0.750.75 through the middle before returning to 1, staying under VP the entire time:

Figure 3 · VE, VP, and sub-VP
t = 0.50
The variance of the noised data over time, on a log scale. VE (the continuous limit of the score-matching line) explodes from 1 toward thousands. VP (the limit of DDPM) is pinned flat at 1. sub-VP dips to about 0.75 through the middle and returns to 1, staying under VP everywhere. Scrub tt to read each value.

This is more than tidiness. The same code, swapping one SDE for another, sweeps a design space the discrete methods kept apart, and the paper uses that freedom to find architectures and schedules that beat both predecessors. The split between score matching and diffusion was never a fork in the road, only two stops along one continuous dial.

The probability-flow ODE

The reverse SDE (6) is random: each step adds a fresh kick, so two runs from the same prior sample trace different paths. The paper's next result is that every such diffusion has a deterministic shadow, an ordinary differential equation with no noise term, whose cloud of outcomes matches the SDE's at every single time:

dx=[f(x,t)12g(t)2xlogpt(x)]dtd\mathbf{x} = \big[\mathbf{f}(\mathbf{x}, t) - \tfrac{1}{2}g(t)^2\,\nabla_{\mathbf{x}}\log p_t(\mathbf{x})\big]\,dt
(13)

The paper calls this the probability flow ODE. It is identical to the reverse SDE except for two changes: the random g(t)dwˉg(t)\,d\bar{\mathbf{w}} term is gone, and the coefficient on the score has dropped from g(t)2g(t)^2 to exactly half, 12g(t)2\tfrac{1}{2}g(t)^2. That half is not a typo, and it traces back to the conservation law the noised density obeys, the Fokker-Planck equation: density is never created or destroyed, it only flows. That equation has two parts, a transport term for the drift and a diffusion term for the random spreading, and the diffusion term carries a literal 12\tfrac{1}{2}, the factor that always rides with a second derivative. Rewriting that second-order spreading as a first-order flow, a smooth velocity field in place of random kicks, moves that same 12\tfrac{1}{2} onto the score. The half is inherited from the diffusion term. So the reverse SDE keeps the full g(t)2g(t)^2 and its random kick, while the ODE keeps half the score and no kick at all. Both push the same density around, which is why they agree at every noise level.

Watch a stochastic run and its deterministic twin side by side. They take visibly different paths, the SDE jagged and the ODE smooth, but the cloud of points looks the same in both panels at every moment:

Figure 4 · same marginals, different paths
σ = 4.00
Both panels generate from the same prior samples toward the same three data clusters. The reverse SDE (left) injects a fresh kick every step and traces jagged paths; the probability-flow ODE (right) is deterministic and smooth. Press play or scrub: the cloud of points stays statistically identical between the two, even though no single path matches.

A deterministic process you can run forward and backward is a powerful thing to have. Because the ODE is smooth and reversible, it is an instance of a neural ODE, a continuous flow whose velocity is a learned network, and that connection hands over exact likelihood. As the flow squeezes or stretches a tiny blob of probability mass, the density inside it must rise or fall to keep the total mass at one; track that local volume change all the way along the trajectory and you accumulate the exact log-probability of the image, with no bound and no sampling. The same determinism gives a unique latent code for any input, lets you interpolate smoothly in that latent, and lets an adaptive ODE solver pick its own step size. None of that is available from a noisy sampler. For the VE case in particular, where there is no drift and g(t)2=d[σ2(t)]/dtg(t)^2 = d[\sigma^2(t)]/dt, the ODE (13) simplifies to a clean rule in the noise level alone:

dxdσ=σxlogpσ(x)\frac{d\mathbf{x}}{d\sigma} = -\sigma\,\nabla_{\mathbf{x}}\log p_\sigma(\mathbf{x})

Most modern fast samplers run exactly this update. (It is a direct specialization of the ODE (13), not a separate equation the paper writes in this form.)

Predict, then correct

Solving the reverse SDE on a computer means taking finite steps, and finite steps accumulate error. After a predictor step (any numerical SDE solver) the sample has drifted slightly off where it should be: its distribution no longer quite matches the true noised density ptp_t for the current time. You might reach for smaller steps, but finer integration only tracks the trajectory more closely; it never checks whether the cloud of samples still has the right shape. The paper adds a second move that checks the shape directly.

The fix uses something the solver does not: we already have the score, and the score is everything Langevin dynamics needs to sample a distribution directly. A Langevin step nudges a point uphill along the score and adds a calibrated jitter:

xx+ϵsθ(x,t)+2ϵz\mathbf{x} \leftarrow \mathbf{x} + \epsilon\,\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x}, t) + \sqrt{2\epsilon}\,\mathbf{z}

The two terms balance. The score term alone is gradient ascent on the density, which would march every point to the single highest peak and collapse the cloud. The 2ϵz\sqrt{2\epsilon}\,\mathbf{z} noise pushes back out, and at the calibrated size the two settle into the full spread of ptp_t rather than its peak. That balance is why iterating the step has ptp_t as its stationary distribution, so a few steps pull a sample back onto the correct density without advancing the clock. The paper pairs the two roles into a predictor-corrector sampler: the predictor takes one step down in time, then the corrector runs a handful of Langevin steps to re-settle the sample onto ptp_t before the next step. The framing also reveals the old samplers as special cases. The score-matching method was predictor-free, all corrector (an identity predictor plus annealed Langevin). DDPM was corrector-free, all predictor: its ancestral sampler draws each cleaner step directly from the previous noisier one down the chain, with nothing in between to re-settle the distribution (an identity corrector). The predictor-corrector sampler is the general recipe both had been approximating.

# Predictor-Corrector sampling of the reverse SDE
x = sample_prior()                     # x(T) ~ N(0, sigma_max^2 I)
for i in range(N, 0, -1):              # walk time T -> 0
    x = predictor_step(x, t[i])        # one reverse-SDE solver step
    for _ in range(M):                 # M Langevin corrector steps
        z = randn_like(x)
        x = x + eps * s_theta(x, t[i]) + sqrt(2 * eps) * z
return x                               # a sample from p_0

Drag the number of corrector steps. At zero the cloud carries the predictor's error, here drawn as a too-tight clump at each mode; add Langevin steps and the cloud relaxes outward until it matches the true noised density:

Figure 5 · the corrector
M = 0
The faint amber cloud is the true noised density pσp_\sigma at this level. The teal cloud is the sample after a predictor step that left it too tightly packed. Each Langevin corrector step nudges along the score and adds calibrated noise; drag MM up and the cloud relaxes onto the target. The corrector reuses the same score the predictor used.

That "stationary distribution" claim comes with a qualifier. The clean theory holds in the limit of infinitely many, infinitely small Langevin steps; a finite number of finite steps leaves a small bias. That bias is why the procedure anneals across noise levels and alternates prediction with correction rather than leaning on either alone, and the paper measures that the combination beats either the predictor or the corrector run by itself for the same compute budget.

Steering generation without retraining

One unconditional model, trained once, can be steered after the fact. Suppose you want samples consistent with some condition y\mathbf{y}: a class label, the visible part of a half-erased image, the grayscale version of a photo. The conditional reverse SDE is the unconditional one with a single extra term:

dx={f(x,t)g(t)2[xlogpt(x)+xlogpt(yx)]}dt+g(t)dwˉd\mathbf{x} = \big\{\mathbf{f}(\mathbf{x}, t) - g(t)^2\big[\nabla_{\mathbf{x}}\log p_t(\mathbf{x}) + \nabla_{\mathbf{x}}\log p_t(\mathbf{y}\mid\mathbf{x})\big]\big\}\,dt + g(t)\,d\bar{\mathbf{w}}
(14)

The first gradient inside the bracket is the unconditional score we already trained. The second, xlogpt(yx)\nabla_{\mathbf{x}}\log p_t(\mathbf{y}\mid\mathbf{x}), is a guidance term: how the log-likelihood of the condition changes as you move x\mathbf{x}. Adding the two gradients steers the same flow toward the region where the condition holds. The guidance can come from a small auxiliary classifier or, for problems like inpainting, from the structure of the problem itself, with no extra training. The unconditional model never has to know which conditioning you will eventually ask for.

Pick a class and watch every particle get funneled to it. With no class selected the guidance term is zero, so the unconditional score spreads particles across all the data; select a class and the second gradient redirects the same flow to that one target:

Figure 6 · controllable generation
Three labeled classes. With no class selected, the unconditional score sends particles to all three. Select a class and a guidance gradient is added to that same score, funneling every particle to the chosen target. One model, no retraining, different conditions.

The paper demonstrates this on class-conditional generation, inpainting, and colorization, all from a single unconditional score model. The continuous formulation makes this clean: conditioning is one more gradient added to the drift, and the reverse SDE accepts that gradient from any source.

What it achieves

The framework was tidy, and it also set records. On CIFAR-10 unconditional generation, the best sample-quality model (a VE SDE with an improved architecture, deepened, sampled with the predictor-corrector scheme) reached an Inception score of 9.899.89 and an FID of 2.202.20. Inception score rewards samples a classifier finds both confidently recognizable and spread across classes, with higher better and roughly 11 the ceiling on CIFAR-10, so 9.89 sits near the top. FID measures the distance between the statistics of generated and real images, lower being better, and an FID near 2 means the generated set is statistically almost indistinguishable from real CIFAR-10. Those were the best reported numbers at the time for generating CIFAR-10 without class labels, beating even the strongest class-conditional GAN of the day (StyleGAN2-ADA, FID 2.422.42), which had the easier job of being told what to draw.

The deterministic ODE side delivered on likelihoods. A separate model (a sub-VP SDE, also deepened, trained with the continuous score-matching objective) reached 2.992.99 bits per dimension on CIFAR-10. Bits per dimension is the model's assigned cost to store one real image, averaged over each pixel and color channel, and lower means the model finds real images more probable. The number is a record because the probability-flow ODE computes the likelihood exactly: a variational model like a VAE can only report an upper bound, since its ELBO bounds the true likelihood from one side, whereas integrating the ODE gives the real figure, not a ceiling. Best quality and best likelihood came from different models, a VE for sharp samples and a sub-VP for tight likelihoods, which is itself the practical lesson: the right SDE depends on what you are optimizing.

And scale held up. Combining the architecture and the new samplers, the paper produced the first high-fidelity 1024×10241024 \times 1024 images ever generated by a score-based model, on CelebA-HQ faces. The limitation the authors name is real: even with the improved samplers, drawing a sample still takes many network evaluations, so generation is slower than a GAN's single forward pass. Closing that gap, fast sampling with the stable training of score models, is the thread two lines of follow-up work have pulled on ever since: faster ODE solvers that reach a good sample in a handful of steps, and distillation, which trains a small student network to reproduce in one or two passes what the full sampler does in hundreds.

The reframing outlived the records. A diffusion model is one continuous SDE that turns data into noise; generation runs that SDE backward, which needs only the score; the score is learned by denoising; and the same score gives a deterministic probability-flow ODE, a corrector, and conditional control. Nearly every diffusion system since, Stable Diffusion among them, stands on this continuous-time view, and later work has gone the other way, reading the residual layers of an ordinary network as the steps of exactly this reverse process.

Provenance Verified against primary literature
Anderson (1982)The reverse of a diffusion is itself a diffusion, driven by the score: the reverse-time SDE.
Hyvärinen (2005) / Vincent (2011)Score matching, and the denoising form that makes the training loss tractable.
SMLD / NCSN (Song & Ermon, 2019)Score matching with Langevin dynamics; its continuous limit is the VE SDE.
DDPM (Ho et al., 2020)Denoising diffusion; its continuous limit is the VP SDE, sampled by one reverse-SDE discretization.
Neural ODEs (Chen et al., 2018)The probability-flow ODE is a neural ODE, the route to exact likelihoods.
correctionThe reverse-time SDE carries the full g(t)² on the score (Eq 6); the probability-flow ODE carries exactly half, ½ g(t)² (Eq 13). Both coefficients are deliberate, and we keep them distinct.

Questions you might still have

?

If the SDE and the ODE give the same samples, why keep both?
They share marginals, not jobs. The reverse SDE injects fresh noise every step, which can wash out earlier errors and tends to give better sample quality. The probability-flow ODE is deterministic, which gives exact likelihoods, a unique latent encoding, and fast adaptive solvers. You pick the one whose extra property you want.

?

Why does the ODE have half the score coefficient of the SDE?
The spread of the noised density is fixed, but it can be reproduced two ways: by random kicks or by a smooth velocity field. Folding all of that spreading into a deterministic flow uses only half of the score term the noisy version carries, so the reverse SDE keeps the full g(t)² and a random kick while the ODE keeps half the score and no kick.

?

Is this just DDPM?
DDPM is one point in this framework. Its noising is the discrete form of the variance-preserving SDE, and its sampler is one discretization of that SDE in reverse. The continuous view adds the variance-exploding family (the older score-matching line), the sub-VP family, the deterministic probability-flow ODE with exact likelihoods, the predictor-corrector sampler, and conditional generation.

?

How do you steer what it generates?
The conditional reverse SDE is the unconditional one plus a guidance gradient, the gradient of the log-likelihood of your condition. One unconditional score model can then be steered to a class label, to fill in a masked region, or to add color to a grayscale photo, all with no retraining, because the reverse SDE uses that extra gradient regardless of its source.

Footnotes & further reading

  1. The paper: Song, Sohl-Dickstein, Kingma, Kumar, Ermon, Poole, Score-Based Generative Modeling through Stochastic Differential Equations (ICLR 2021). Code.
  2. The time-reversal of a diffusion: Anderson, Reverse-time diffusion equation models (Stochastic Processes and their Applications, 1982).
  3. The two ancestors this paper unifies: Song & Ermon, Generative Modeling by Estimating Gradients of the Data Distribution (NCSN / SMLD, 2019), and Ho et al., Denoising Diffusion Probabilistic Models (DDPM, 2020), building on Sohl-Dickstein et al. (2015).
  4. Score matching and its denoising form: Hyvärinen, Estimation of Non-Normalized Statistical Models by Score Matching (2005), and Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011).
  5. The probability-flow ODE as a neural ODE: Chen et al., Neural Ordinary Differential Equations (2018); related deterministic-flow constructions for Fokker-Planck appear in Maoutsa et al. (2020).
  6. What the framework grew into: the latent-space, text-conditioned descendant in Latent Diffusion (Stable Diffusion), the practical VE conventions and noise schedules of Karras et al., EDM (2022), and the residual-network reading in DiffusionBlocks.