Diffusion · Generative models

Generative Modeling by Estimating Gradients of the Data Distribution

Learn the direction toward more data, then follow it from noise to an image.

No adversary, no normalizing constant, no Markov chain during training. Estimate one vector field, the gradient of the log-density, at a ladder of noise levels, and a noisy walk down that ladder turns static into samples.

Explaining the paperGenerative Modeling by Estimating Gradients of the Data DistributionYang Song, Stefano Ermon · Stanford · NeurIPS 2019 · arXiv:1907.05600 ↗

Start from a screen of static. Nudge it, over and over, in the direction that makes it look a little more like real data, with a small random jitter on every step, and a face fades in out of the noise.

By 2019 generative modeling had two dominant recipes, and each paid a tax. Likelihood-based models, the family that includes the variational autoencoder and autoregressive models, optimize the probability a model assigns to the data. To do that they need a normalized probability, one that sums to one over all possible images, which forces either restrictive architectures or a surrogate objective standing in for the real one. Generative adversarial networks drop the probability entirely and train a generator against a discriminator, which sidesteps normalization but buys an unstable two-player game whose loss tells you almost nothing about how good the model is.

This paper takes a third route that needs neither. The object it learns is not a probability and not a generator. It is a vector field: at every point in image space, an arrow saying which way to move to make the image more probable. Learn that field and you can sample by walking along it. The whole method is a few ideas stacked in order: what that field is and why it dodges the normalizing constant, how a noisy walk along it draws samples, the two concrete ways the naive version falls apart, and the single idea, adding noise at many scales, that repairs all of them at once.

The score: a direction toward data

Write $p(\mathbf{x})$ for the density of the data: high where realistic images cluster, near zero in the vast emptiness of pixel combinations that look like nothing. The quantity this paper is built on is the gradient of its logarithm, called the score:

\mathbf{s}(\mathbf{x}) = \nabla_{\mathbf{x}} \log p(\mathbf{x})

(1)

Read it literally. The gradient of $\log p$ with respect to $\mathbf{x}$ points in the direction that increases the log-density fastest. Stack one such arrow at every point and you get a field that everywhere points toward where data is denser. This is the Stein score, the gradient with respect to the data point. Keep it apart from the older statistics term, the Fisher score, which is the gradient with respect to a model's parameters and answers a different question. Same word, different variable.

One property makes the score worth estimating at all, and it is why this line of work exists. Any density is really some unnormalized density $\tilde{p}(\mathbf{x})$ divided by a normalizing constant $Z = \int \tilde{p}(\mathbf{x})\,d\mathbf{x}$ , the integral over the entire space that makes the probabilities sum to one. That $Z$ is the curse of probabilistic modeling: a sum over every possible image, hopeless to compute. But the log turns the division into a subtraction, $\log p = \log \tilde{p} - \log Z$ , and $Z$ does not depend on $\mathbf{x}$ , so the instant you take the gradient it vanishes:

\nabla_{\mathbf{x}} \log p(\mathbf{x}) = \nabla_{\mathbf{x}} \log \tilde{p}(\mathbf{x}) - \underbrace{\nabla_{\mathbf{x}} \log Z}_{=\,0} = \nabla_{\mathbf{x}} \log \tilde{p}(\mathbf{x})

The score captures the shape of the density, every hill and valley, with no reference to its total volume. Energy-based models spend enormous effort estimating $Z$ ; the score steps around it entirely. That sidestep is why score-based modeling is practical, and the rest of the paper is about how to estimate this field and how to turn it into samples.

The field also behaves the way intuition wants as you blur the data. Picture the density at a chosen blur radius $\sigma$ , the data smeared by a Gaussian of that width. At small $\sigma$ the density is sharp and concentrated, so the arrows snap toward the nearest cluster of real data. At large $\sigma$ everything smears into one broad hill and the arrows point gently toward the global center of mass. Drag the slider and watch the field stiffen and relax:

Figure 1 · the score field

noise σσ = 1.00

The score

\nabla\log p_\sigma

is a vector field: at every point it aims where probability mass is densest. At small

\sigma

it points to the nearest data cluster; at large

\sigma

the clusters blur together and it points gently toward the global center. If you had this field, generation would be a walk along it.

So the entire game reduces to two questions. How do you estimate this field from a pile of data with no ground-truth arrows? And once you have it, how do you turn a field of arrows into actual samples? Take the second one first, because its answer is older and sets up everything else.

Following the score: Langevin dynamics

Suppose someone hands you the exact score field. The obvious move is to drop a point somewhere and keep stepping in the arrow direction, climbing uphill in log-density. That is gradient ascent, and it does the wrong thing: it walks every starting point to the nearest peak and stops. You would get a handful of the most probable images over and over, never the rich variety the data actually has. Climbing to a peak is finding the single most likely point, the mode. Drawing a sample means visiting points in proportion to how probable they are, peaks often and valleys rarely, never freezing.

Langevin dynamics solves this, and it changes gradient ascent in exactly one place: it adds a fresh kick of Gaussian noise on every step. Starting from a prior sample $\tilde{\mathbf{x}}_0$ and a small step size $\epsilon$ , it repeats

\tilde{\mathbf{x}}_t = \tilde{\mathbf{x}}_{t-1} + \frac{\epsilon}{2}\,\nabla_{\mathbf{x}} \log p(\tilde{\mathbf{x}}_{t-1}) + \sqrt{\epsilon}\,\mathbf{z}_t, \qquad \mathbf{z}_t \sim \mathcal{N}(\mathbf{0},\mathbf{I})

(2)

Half a step of drift up the score, plus a jolt of noise scaled by $\sqrt{\epsilon}$ . As $\epsilon \to 0$ and the number of steps grows, the points this produces are distributed exactly as $p$ : the recipe samples the density using nothing but its score, no normalizing constant required. (Strictly there is a Metropolis accept-reject correction for finite $\epsilon$ , which the paper drops as negligible when the steps are small.) The injected noise does the work. With it, a point near a peak gets knocked back out and keeps exploring; without it, it sticks. The drift keeps the walk near high-density regions; the noise keeps it moving so it covers them in proportion to their mass.

Below, a swarm of points runs the update on a landscape with several wells. The figure shows the role of that noise term directly: with the noise multiplier near zero every point collapses onto a peak and freezes, which is optimization. Turn it up and the cloud spreads across all the wells and keeps churning, reproducing the distribution it is sampling from.

Figure 2 · sampling vs. optimizing

noise ×kk = 1.00

Langevin dynamics: points climb the score toward the data while a Gaussian kick jostles them. Set the noise multiplier to zero and they collapse onto the peaks and stop (that is finding modes, not sampling); around one they spread across every cluster and keep moving, drawing samples. The injected noise alone separates sampling from optimizing.

Nothing else is needed to sample. Train a network so that $\mathbf{s}_{\boldsymbol\theta}(\mathbf{x}) \approx \nabla_{\mathbf{x}} \log p(\mathbf{x})$ , run (2) with the network standing in for the true score, and you have a generative model. It is a clean plan, and run naively on real data it fails twice over. Seeing exactly how it fails motivates everything the paper actually does.

Two places this breaks

The first failure is geometric. Real data does not fill its space. A 32-by-32 color image lives in about three thousand dimensions, yet the images that look like anything occupy a thin, lower-dimensional sheet inside it, the data manifold. This picture is called the manifold hypothesis, and it wrecks the score. Off the sheet the density is exactly zero, so $\log p$ is minus infinity and its gradient, the score, is simply undefined there. There is no "uphill" to point to in directions that leave the manifold. Worse, the standard way to fit a score is a consistent estimator, meaning it converges to the truth given enough data, only when the data has support everywhere; on a thin manifold that guarantee evaporates, and the fit has no reason to be right.

The second failure is about empty space, and it has two halves. Fitting the score is an average over training examples, so in the wide gaps between clusters, where no samples ever land, there is no signal and the learned arrows are guesses. And even granting perfect arrows, a subtler problem remains, the one that actually sinks the naive sampler. Consider data that is a mix of two well-separated clusters, $p = \pi\,p_1 + (1-\pi)\,p_2$ , where $\pi$ is the fraction of mass in the first cluster, say $\pi = 0.8$ . Inside the first cluster, where $p_2$ is essentially zero, the log-density is $\log(\pi\,p_1) = \log \pi + \log p_1$ . Now take the gradient. The term $\log\pi$ is a constant, so its gradient is zero, and the score collapses to $\nabla\log p_1$ :

\nabla_{\mathbf{x}} \log p(\mathbf{x}) = \nabla_{\mathbf{x}} \log p_1(\mathbf{x}) \quad\text{on the first cluster, with no trace of } \pi

The mixing weight has been annihilated by the gradient. The score inside a cluster reflects the shape of that cluster and carries no information about how much total mass it holds relative to the other. A sampler that only ever consults the score therefore has no way to recover $\pi$ . Drop points by Langevin into two separated clusters and they split by the geometry of which basin they started nearest, not by the 80/20 weights, and across the empty gap between clusters the score is essentially zero, so nothing can shuttle probability from one to the other to fix the ratio. This is not an artifact of a badly trained network. It happens with the exact, ground-truth score, and that part is the one to hold onto: the failure is structural, not an estimation error. The next two sections build the cure, and the figure that proves it comes with them.

One fix: blur the data with noise

Every one of those problems has the same root, that the data is a thin, spiky thing with vast dead space around it, and one move treats all of them: deliberately blur the data by adding Gaussian noise. Perturbing each clean point $\mathbf{x}$ with noise of standard deviation $\sigma$ defines a smoothed density

q_\sigma(\mathbf{x}) = \int p_{\text{data}}(\mathbf{t})\,\mathcal{N}(\mathbf{x}\mid \mathbf{t}, \sigma^2\mathbf{I})\,d\mathbf{t}

(3)

which is $p_{\text{data}}$ convolved with a Gaussian, the data seen through frosted glass of roughness $\sigma$ . This one operation pays off three times. The blurred density has support everywhere, since a Gaussian has tails over the whole space, so the manifold problem is gone and the score of $q_\sigma$ is defined at every point. The blur spreads samples into the gaps, so the dead low-density regions now carry signal. And at a large enough $\sigma$ , the two clusters smear into each other so a path of nonzero density connects them, which is exactly what restores the lost mixing weight: once the supports overlap, the score depends on $\pi$ again.

Drag $\sigma$ from near zero upward in the figure. At the bottom the data is a hairline curve and the score lives only on it, undefined in the black around it; as the noise grows, the curve thickens into a band and then floods the whole frame with a well-defined field pointing home.

Figure 3 · noise gives the score support

noise σσ = 0.06

The data manifold is a thin 1-D curve in the plane. At

\sigma \approx 0

the score is defined only on the hairline and undefined off it. Raise

\sigma

and

q_\sigma

gains full support, the dead gaps fill with signal, and the field points toward the data everywhere. Two of the three fixes are visible here; connecting separated modes shows up in Figure 5.

There is an obvious objection, and the paper meets it head on with a baseline. If a touch of noise fixes the manifold problem, why not add the smallest possible amount and be done? Because small noise only buys the first repair. A tiny $\sigma$ thickens the manifold just enough to define the score, but it does nothing for the empty regions far from data and nothing to connect separated modes, so a sampler built on it still fails. The paper checks this directly: a single small noise level of $\sigma = 0.01$ on its own cannot generate reasonable images. You need large noise too, for the gaps and the modes, and small noise for fidelity. That tension, large for reach and small for detail, is why a single $\sigma$ will not do and why the method ends up using a whole ladder of them.

Learning the score without the answer

Now the first question comes due: how do you fit a network to the score when you have no ground-truth arrows to regress against? You cannot write down a labeled target $\nabla\log p(\mathbf{x})$ ; that is the very thing you do not know. The classical answer is score matching (Hyvärinen, 2005), and it is a small piece of calculus that makes the unknown target disappear. The thing you wish you could minimize is the squared error between your network and the true score, $\tfrac{1}{2}\,\mathbb{E}_{p}\big[\lVert \mathbf{s}_{\boldsymbol\theta} - \nabla\log p\rVert^2\big]$ . Expand the square and one cross-term involves the unknown $\nabla\log p$ . Integration by parts trades that term for the divergence of the network, leaving an objective that never mentions the true score:

\mathbb{E}_{p_{\text{data}}(\mathbf{x})}\!\left[\operatorname{tr}\!\big(\nabla_{\mathbf{x}} \mathbf{s}_{\boldsymbol\theta}(\mathbf{x})\big) + \tfrac{1}{2}\lVert \mathbf{s}_{\boldsymbol\theta}(\mathbf{x})\rVert_2^2\right]

(4)

(The factor of one half sits only on the squared-norm term; the trace term is bare.) The minimizer of (4) is the true score, and the price of admission was a single integration by parts. But that price hides a sting in the trace term. $\operatorname{tr}(\nabla_{\mathbf{x}} \mathbf{s}_{\boldsymbol\theta})$ is the sum of the diagonal of the network's Jacobian, and computing it for a $D$ -dimensional image takes work that grows with $D$ , one backward pass per dimension. For three-thousand-dimensional images that does not scale, and the next two methods exist to get around that trace.

Sliced score matching keeps the trace but estimates it cheaply. Instead of all $D$ diagonal entries, it projects the Jacobian onto a single random direction $\mathbf{v}$ and uses $\mathbf{v}^\top (\nabla \mathbf{s}_{\boldsymbol\theta})\,\mathbf{v}$ , whose average over random $\mathbf{v}$ equals the trace (a standard trick: the expected quadratic form recovers the diagonal sum). One forward-mode derivative computes it, at roughly four times the cost of a plain forward pass, and it estimates the score of the data as it is, no blurring. Denoising score matching (Vincent, 2011) goes further and removes the trace entirely, by giving up on the clean data and aiming at the blurred $q_\sigma$ instead. For Gaussian noise the per-example target it needs can be written in closed form, with no trace anywhere, and that closed form is why the method is cheap to train. The score of the conditional, the noised sample given the clean one, is

\nabla_{\tilde{\mathbf{x}}} \log q_\sigma(\tilde{\mathbf{x}}\mid\mathbf{x}) = -\frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma^2} = \frac{\mathbf{x} - \tilde{\mathbf{x}}}{\sigma^2}

(5)

an arrow pointing from the noised sample straight back to the clean original it came from, scaled by $1/\sigma^2$ . You make the noise yourself, so you always know that arrow. It is a free supervised label, no trace and no unknown target. Train the network to regress onto it:

\ell(\boldsymbol\theta;\sigma) = \tfrac{1}{2}\,\mathbb{E}_{p_{\text{data}}(\mathbf{x})}\,\mathbb{E}_{\tilde{\mathbf{x}}\sim\mathcal{N}(\mathbf{x},\sigma^2\mathbf{I})}\!\left[\Big\lVert \mathbf{s}_{\boldsymbol\theta}(\tilde{\mathbf{x}},\sigma) + \frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma^2}\Big\rVert_2^2\right]

(6)

The plus sign is right, not a typo: the target is the negative quantity $-(\tilde{\mathbf{x}}-\mathbf{x})/\sigma^2$ , and subtracting it inside the norm flips the sign. There is a catch worth stating plainly, because it is the most misread fact about this method. Any single label points at one specific clean image, but a given noised $\tilde{\mathbf{x}}$ could have come from many clean images, so the network cannot match any individual label. What it learns is their average, and that average is the score of the blurred distribution $q_\sigma$ , not the clean $p_{\text{data}}$ . The two agree only when $\sigma$ is small enough that the blur barely changes anything. At large $\sigma$ the network deliberately learns the score of a heavily smoothed density, and that is the feature, not a bug: the smoothed score is the one with full support and connected modes.

In the figure, the thin teal arrows are the individual labels, each pointing back to its own clean source and each fairly noisy. Drag the probe and the bold amber arrow is the average of the labels nearby; the white arrow is the true score of $q_\sigma$ computed independently. They line up. The messy per-sample supervision averages out to the exact thing you wanted.

Figure 4 · the free label averages to the score

noise σσ = 0.70

Clean data in amber; around it a cloud of self-noised samples in teal, each with a thin arrow back to its source, the self-made target

(\mathbf{x}-\tilde{\mathbf{x}})/\sigma^2

. Drag the probe: the bold amber arrow is the proximity-weighted average of the nearby labels, and it tracks the white true score

\nabla\log q_\sigma

. Noisy labels, exact average.

That averaging has a name in its own right, Tweedie's formula (Robbins, 1956). The optimal denoiser (the function returning the average clean image behind a noised one) and the score of the blurred density are the same object up to a factor of $\sigma^2$ : $\nabla\log q_\sigma(\tilde{\mathbf{x}}) = (\mathbb{E}[\mathbf{x}\mid\tilde{\mathbf{x}}] - \tilde{\mathbf{x}})/\sigma^2$ . A denoiser is a score estimator. NCSN does not use that framing directly, it predicts the score, but it is why the later DDPM line, which predicts the noise instead, is the same idea wearing different clothes.

One network for every noise level

The pieces are now in hand. Use denoising score matching, because it is cheap and it naturally targets the blurred density you actually want. Use many noise levels, because large noise fixes the gaps and the modes while small noise preserves detail. The paper ties these together into the Noise Conditional Score Network (NCSN): a single network $\mathbf{s}_{\boldsymbol\theta}(\mathbf{x},\sigma)$ that takes the noise level as an extra input and learns the score of $q_\sigma$ at every level at once. The levels are a geometric ladder

\{\sigma_i\}_{i=1}^{L}, \qquad \frac{\sigma_1}{\sigma_2} = \cdots = \frac{\sigma_{L-1}}{\sigma_L} > 1, \qquad \sigma_1 \text{ large},\ \ \sigma_L \approx 0

(7)

with $\sigma_1$ big enough to bridge the modes and $\sigma_L$ small enough that $q_{\sigma_L}$ is nearly the clean data. For images the paper uses $L = 10$ levels from $\sigma_1 = 1$ down to $\sigma_{10} = 0.01$ , a noise of $0.01$ being invisible to the eye on pixels in $[0,1]$ . Why one shared network conditioned on $\sigma$ rather than ten separate ones? The score fields at adjacent levels are nearly the same field, so a single network amortizes the work and the levels reinforce each other instead of each starting from scratch. The full objective is the per-level loss (6) summed over the ladder:

\mathcal{L}(\boldsymbol\theta) = \frac{1}{L}\sum_{i=1}^{L} \lambda(\sigma_i)\,\ell(\boldsymbol\theta;\sigma_i)

(8)

The weights $\lambda(\sigma_i)$ matter more than they look. Without them the high-noise levels, where the target arrow $(\mathbf{x}-\tilde{\mathbf{x}})/\sigma^2$ is small, would contribute almost nothing to the gradient and the low-noise levels would dominate. The paper notices empirically that a well-trained score has magnitude $\lVert \mathbf{s}_{\boldsymbol\theta}(\mathbf{x},\sigma)\rVert_2 \propto 1/\sigma$ , which suggests setting $\lambda(\sigma) = \sigma^2$ : then each level's loss term comes out to $\tfrac{1}{2}\,\mathbb{E}\big[\lVert \sigma\,\mathbf{s}_{\boldsymbol\theta} + (\tilde{\mathbf{x}}-\mathbf{x})/\sigma\rVert^2\big]$ , where the residual $(\tilde{\mathbf{x}}-\mathbf{x})/\sigma$ is a unit Gaussian and $\sigma\,\mathbf{s}_{\boldsymbol\theta}$ has magnitude near one, so every level contributes at the same order of magnitude. This weighting is a calibrated guess from an observation, not a result derived from first principles, and it should be read that way.

Make it concrete. Say the data is $32\times32$ color images, so $D = 3072$ , and take the ten-level ladder above. One training step: draw a clean image $\mathbf{x}$ , pick a level, say $\sigma = 0.16$ , draw fresh noise $\boldsymbol\epsilon \sim \mathcal{N}(\mathbf{0},\mathbf{I})$ of shape $3072$ , and form the noised input $\tilde{\mathbf{x}} = \mathbf{x} + 0.16\,\boldsymbol\epsilon$ . The label is $(\mathbf{x}-\tilde{\mathbf{x}})/0.16^2 = -\boldsymbol\epsilon/0.16$ , which is just the noise you added, flipped and rescaled. Feed $(\tilde{\mathbf{x}}, 0.16)$ through the network, regress its output onto that label, scale the squared error by $\sigma^2 = 0.0256$ , and step. No Markov chain, no second network, no normalizing constant ever appears:

# one training step for the noise-conditional score net s_theta
x      = sample_batch()             # clean data  [n, D]
sigma  = choice(sigmas)             # a level from the geometric ladder
eps    = randn_like(x)              # fresh Gaussian noise
x_t    = x + sigma * eps            # noised input:  x~ = x + sigma*eps
target = (x - x_t) / sigma**2       # = -eps/sigma : the free label
loss   = sigma**2 * mse(s_theta(x_t, sigma), target)   # weight lambda=sigma^2
loss.backward(); opt.step()         # no MCMC, no adversary, no normalizer

The architecture is a U-Net with dilated convolutions, the same dense-prediction backbone that works for image segmentation, because the output has to be an image-shaped field the same size as the input. The noise level $\sigma$ is fed in through a conditional version of instance normalization, so the one network can shift its behavior across the ladder. Training stops there, and what remains is turning the trained field into pictures.

Annealed Langevin dynamics

Plain Langevin on the data score gets the mode weights wrong, as the break section showed, and a single noise level cannot rescue it. The sampler that does is annealed Langevin dynamics, and the idea is borrowed straight from metallurgy. Heat the metal so its atoms roam freely and find the right global arrangement, then cool it slowly so they lock in without disturbing what they found. Here the noise level $\sigma$ plays the role of temperature. Start hot, at the largest $\sigma_1$ , where the modes are smeared into one connected hill so points can flow between them and settle into the correct proportions. Then cool down the ladder, rerunning Langevin at each level, carrying the points from one level into the next:

\tilde{\mathbf{x}}_t \leftarrow \tilde{\mathbf{x}}_{t-1} + \frac{\alpha_i}{2}\,\mathbf{s}_{\boldsymbol\theta}(\tilde{\mathbf{x}}_{t-1},\sigma_i) + \sqrt{\alpha_i}\,\mathbf{z}_t, \qquad \alpha_i = \epsilon\cdot\frac{\sigma_i^2}{\sigma_L^2}

(9)

This is the Langevin step (2) with the network's score at the current level and a level-dependent step $\alpha_i$ . The step size shrinks as you cool, scaling with $\sigma_i^2$ and normalized by the smallest level $\sigma_L$ so that the last and finest level uses step $\alpha_L = \epsilon$ . The $\sigma_i^2$ scaling is not arbitrary: it keeps the ratio between the score step and the noise step fixed across levels, so the sampler takes bold strides when the field is coarse and small careful ones near the sharp data, the same balance at every scale. The walk starts from pure noise and ends, after the last level, with a sample:

# annealed Langevin dynamics: walk the noise ladder high -> low
x = uniform_noise(n, D)                    # start: pure noise
for sigma in sigmas:                       # sigma_1 (high) ... sigma_L (low)
    alpha = eps * sigma**2 / sigmas[-1]**2 # step size, alpha ~ sigma^2
    for t in range(T):                     # T Langevin steps at this level
        z = randn_like(x)
        x = x + (alpha/2) * s_theta(x, sigma) + sqrt(alpha) * z
return x                                    # a sample from p_data

The figure makes the payoff checkable. Two clusters carry unequal weight, 80 percent on the left and 20 on the right, the same 1/5-versus-4/5 mixture the paper uses. From the same start, plain Langevin at one fixed level splits the points by which basin they began in, landing near 50/50 because it cannot cross the empty gap, exactly the structural failure from before. Switch to annealed Langevin and watch the high-noise levels let the cloud redistribute, so as the noise cools the split lands near the true 80/20. Toggle the two samplers and scrub the chain:

Figure 5 · annealing recovers the weights

σ = 0.42

Two unequal modes (left 80%, right 20%). From the same start, plain Langevin splits the points by geometry (about 50/50) and is trapped by the empty gap. Annealed Langevin walks the noise ladder down, lets the high-noise levels move points between modes, and lands near the true 80/20. The bottom bar reads off the split.

Annealing converts the noised scores, each individually defined and well-estimated, into a sampler that also gets the global structure right. The largest level does the hard part, setting how much mass goes where; each smaller level refines the picture without undoing that split; the smallest level, where $q_{\sigma_L} \approx p_{\text{data}}$ , polishes it into a clean sample. For images the paper runs $T = 100$ Langevin steps at each of the ten levels from a uniform-noise start, with $\epsilon = 2\times10^{-5}$ , and reports the result is robust to those choices.

What it made, and where it sits

The receipts are on CIFAR-10. As an unconditional model, with no class labels, NCSN reaches an inception score of $8.87 \pm 0.12$ , where the inception score rates how recognizable and varied the samples look, higher being better. That was a new state of the art for unconditional models, and it beat most reported numbers for class-conditional models too, which is the headline the paper claims and the one to state precisely. Its FID, the Fréchet inception distance, a separate quality score where lower is better, came in at $25.32$ , and that number was "comparable to top existing models," not a record. SNGAN's $21.7$ is better, and class-conditional BigGAN, at inception $9.22$ and FID $14.73$ , beats NCSN on both. The accurate summary is narrow: best inception among unconditional models, competitive on FID. It is not "NCSN beat the GANs."

Read those numbers against what they cost. A GAN of the era reached comparable quality only through an unstable adversarial game with no usable training signal; NCSN matched it with a plain regression loss, a single network, and no sampling at all during training. The same objective even hands you a number to compare two models with, something the GAN objective never could. The quality is competitive; the route to it is far calmer.

This paper is the hinge the modern diffusion line turns on. Its scattered noise levels are a discretization of a continuous process, and the score-SDE paper that came next made that explicit: NCSN is the variance-exploding branch, the one that keeps the signal fixed and lets the noise variance grow without bound, while DDPM is the variance-preserving sibling that shrinks the signal as it adds noise. Both estimate a score and both run it backward; they differ in the bookkeeping. One caution for anyone reading across the lineage: the very large maximum noise levels and the rescaled $\mathbf{s}_{\boldsymbol\theta}(\mathbf{x})/\sigma$ parameterization often attributed to this work are refinements from the 2020 follow-up (NCSNv2), not the original. The 2019 paper is the one that established the frame this field now lives in: do not learn the probability, learn its gradient, blur to give that gradient meaning everywhere, and anneal the blur away while you sample.

Provenance Verified against primary literature

Score matching (Hyvärinen, 2005)Integration by parts removes the unknown true score (Eq 4).

Denoising SM (Vincent, 2011)Closed-form Gaussian target; learns the blurred score q_sigma.

Tweedie / Robbins (1956)A denoiser equals a score estimator up to sigma^2.

Langevin dynamicsScore-only sampler; the noise term makes it sample, not optimize.

Score-SDE (Song et al., 2021)NCSN is the variance-exploding SDE, discretized.

correctionA common reading blames Langevin's wrong mode weights on a poorly trained score. The paper's own Figure 3 uses the EXACT score and still gets the weights wrong: the gradient annihilates the mixing weight (∇log(π·p₁) = ∇log p₁), so the failure is structural, not an estimation error. Annealing through high noise connects the modes and restores the weight.

Questions you might still have

Why learn the score of the noised data instead of the real data?
The clean-data score is undefined off the thin data manifold and carries no signal in the empty regions between clusters, so it cannot be fit reliably. The noised density q_σ has support everywhere, so its score is defined and learnable at every point. The smallest level σ≈0.01 is close enough to the clean data that its score is the one the sampler finishes on.

If Langevin only needs the score, why add the random noise term at all?
Without it, the update is plain gradient ascent and every point climbs to the nearest peak and freezes, which finds the single most likely image, not a sample. The injected √ε z noise knocks points back out of the peaks so they visit the full distribution in proportion to its probability.

Why does plain Langevin get the mode weights wrong even with a perfect score?
Inside one cluster the mixing weight appears only as an additive constant log π in the log-density, and the gradient kills constants, so the score is identical whatever the weight. Across the empty gap the score is near zero, so probability cannot move between clusters to fix the ratio. High noise overlaps the clusters and restores the weight; annealing carries that fix down to low noise.

How is this different from DDPM?
Same score idea, different bookkeeping. NCSN keeps the signal fixed and lets the noise variance explode (variance-exploding), and samples with annealed Langevin. The DDPM line shrinks the signal as it adds noise (variance-preserving), predicts the noise rather than the score, and samples with a fixed learned reverse chain. The score-SDE paper later unified both as discretizations of one stochastic differential equation.

Was NCSN actually state of the art?
On inception score among unconditional models, yes: 8.87, beating most class-conditional numbers too. On FID it was only comparable, 25.32, with SNGAN at 21.7 and class-conditional BigGAN ahead on both metrics (9.22 / 14.73). The accurate claim is best unconditional inception, competitive FID, not an across-the-board win over GANs.

Footnotes & further reading

The paper: Yang Song and Stefano Ermon, Generative Modeling by Estimating Gradients of the Data Distribution (Stanford, NeurIPS 2019). Code.
Score matching: Aapo Hyvärinen, Estimation of Non-Normalized Statistical Models by Score Matching (JMLR 2005). The trace identity and the full-support consistency condition are Theorems 1 and 2.
Denoising score matching: Pascal Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011), with the closed-form Gaussian target. The denoiser-equals-score identity goes back to Robbins (1956) and Miyasawa (1961), later popularized as Tweedie's formula by Efron (2011).
Sliced score matching: Yang Song, Sahaj Garg, Jiaxin Shi, Stefano Ermon, Sliced Score Matching (UAI 2019), which estimates the Jacobian trace by random projection.
The continuous-time unification, where NCSN becomes the variance-exploding SDE: Score-Based Generative Modeling through SDEs (Song et al., 2021), and the variance-preserving sibling, DDPM (Ho et al., 2020).
The follow-up that revised the noise schedule, the conditioning, and the step sizes (and is the source of the very large $\sigma_{\max}$ and the $\mathbf{s}_{\boldsymbol\theta}(\mathbf{x})/\sigma$ parameterization often credited to the original): Improved Techniques for Training Score-Based Generative Models (NCSNv2, Song & Ermon, 2020).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.