Diffusion · Theory

The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

A diffusion model doesn't need to be told how noisy its input is.

Standard diffusion models are handed the noise level at every step. Drop that input and the best ones keep working, because the geometry of high dimensions already encodes it. Whether generation survives depends on what the model is trained to predict.

Explaining the paperThe Geometry of Noise: Why Diffusion Models Don't Need Noise ConditioningSahraee-Ardakan, Delbracio, Milanfar · Google · 2026 · arXiv:2602.18428 ↗

Add noise in high dimensions and its length barely varies, so how far a point sits from the data all but fixes the level it was noised at.

Every diffusion or flow model is given the noise level as an input. You show the model a noisy version of the data, and you also tell it how much noise is on that input: a single number, usually written $t$ or $\sigma$ (you can read $t$ as how far the noising process has run). The model uses that number to decide how hard to clean. The architecture injects the noise level into every layer, on the assumption that the model needs it. What the model should do at high noise (guess the rough shape of an image) is nothing like what it should do at low noise (refine the last few details). How could one network do both without being told how noisy its input is?

Recent work showed you can remove that input entirely. Sun and collaborators (Is noise conditioning necessary?) trained a single network that sees only the noisy image, never the noise level, and it still generates clean samples. Equilibrium Matching trains the field to keep the data as its fixed points, with no time index at all, so it is noise-blind by design. The literature calls this the autonomous (or noise-blind) setting: one static vector field $f(\mathbf{u})$ , the same function of the input no matter what noise produced it. A single static field somehow has to serve both pure noise and nearly-clean data, and it has to stay stable right at the data, where the gradient diverges.

This paper, from Sahraee-Ardakan, Delbracio, and Milanfar at Google, is the theory of why blind models work. It answers two questions. How does a model know the noise level it was never told? And why do some blind models generate beautifully while others collapse into static? Geometry resolves the first; a stability race resolves the second. The argument runs in a few steps: what the conditioned model is, what happens when you remove the noise input, how high-dimensional geometry supplies the level, what landscape the blind model descends, and why one parameterization survives it while another shatters.

Conditioning on the noise level

The forward process is the easy half of any diffusion model: it takes a clean datapoint $\mathbf{x}$ and a noise level indexed by $t$ , and mixes them into a noisy observation.

\mathbf{u}_t = a(t)\,\mathbf{x} + b(t)\,\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

(2)

The two schedule functions do the bookkeeping: $a(t)$ scales the signal and $b(t)$ scales the noise. Near $t \to 0$ you sit on clean data; crank $t$ up and the noise term swamps everything until the observation is featureless. Different model families pick different $a,b$ . DDPM (the original denoising-diffusion model) shrinks the signal as it adds noise so the total variance stays pinned ( $a^2+b^2=1$ ); EDM (the design from Karras et al.) keeps the signal at full scale while the noise term grows; flow matching (a straight line from data to noise) slides linearly with $a=1-t$ , $b=t$ . The ratio that summarizes how much signal survives is the signal-to-noise ratio (SNR):

\text{SNR}(t) = \frac{a^2(t)}{b^2(t)}

(3)

Training is a plain regression. You show the model a noisy $\mathbf{u}_t$ and ask it to predict a linear target $r = c(t)\mathbf{x} + d(t)\boldsymbol{\epsilon}$ , scored by squared error:

\mathcal{L}(f) = \mathbb{E}_{\mathbf{x},\boldsymbol{\epsilon},t}\big[\,\lVert f(\mathbf{u}_t) - (c(t)\mathbf{x} + d(t)\boldsymbol{\epsilon}) \rVert^2\,\big]

(4)

Squared error has one minimizer: the average of the thing you are predicting, conditioned on everything you can see. Here that is the conditional mean of the target, given the observation and the noise level.

f^*_t(\mathbf{u}) = \mathbb{E}_{\mathbf{x},\boldsymbol{\epsilon}\mid\mathbf{u},t}\big[\,c(t)\mathbf{x} + d(t)\boldsymbol{\epsilon}\,\big]

(5)

This is a time-dependent vector field: a different arrow at every (position, noise level) pair. The four coefficients $(a,b,c,d)$ are the only difference between the famous models. DDPM predicts the noise $\boldsymbol{\epsilon}$ ( $c{=}0, d{=}1$ ), EDM predicts the clean signal $\mathbf{x}$ ( $c{=}1, d{=}0$ ), flow matching predicts a velocity, the straight-line direction from data toward noise ( $c{=}{-}1, d{=}1$ ). The $t$ input is the switch that lets one network be a different function at each noise level.

A switch seems necessary. At high noise the field should pull any point toward the blurry average of all the data, since that is the best guess when you can barely see. At low noise it should pin a point to the nearest sharp datapoint. Those are opposite fields, and $t$ is what tells the network which one to be. Drop $t$ and you seem to lose the switch. You do not.

With the noise level removed, the network sees only $\mathbf{u}$ and must output a single vector $f(\mathbf{u})$ , the same function no matter what noise produced the input. The optimal such field is not mysterious. It is still the least-squares answer, but you can no longer condition on $t$ because you do not have it, so the best you can do is average over your uncertainty about it:

f^*(\mathbf{u}) = \mathbb{E}_{t\mid\mathbf{u}}\big[\,f^*_t(\mathbf{u})\,\big]

(6)

The weight in that average is the posterior over the noise level, $p(t\mid\mathbf{u})$ : your belief about which $t$ produced this observation, by Bayes' rule from the likelihood of seeing $\mathbf{u}$ at each level. So the optimal blind field is a posterior-weighted blend of all the conditioned fields. This is the law of iterated expectations: your best blind guess is the average of your informed guesses, each weighted by how likely its scenario is.

That blend is enough to generate. Below is the optimal blind field for a small five-point dataset, computed in closed form. The field never changes, and nothing tells it the noise level. Press play and a ring of pure-noise particles rides this one frozen field inward and settles exactly onto the data, where the field vanishes, then stops.

Figure 1 · one static field generates

step 0

A single time-invariant field f*(u), with no noise level fed in anywhere. Press play or scrub: noise particles follow this one unchanging field onto the data points and stop. The data are stable equilibria, where the field is zero. Autonomous generation: one frozen field carries noise to data.

How can an average over noise levels be the right field anywhere? At a given $\mathbf{u}$ , if the posterior $p(t\mid\mathbf{u})$ is spread across many levels, then (6) blends a high-noise field (points to the center) with a low-noise field (points to a specific mode), and the average of two contradictory arrows should be meaningless. But the posterior is usually not spread at all.

Drag the probe below and watch $p(t\mid\mathbf{u})$ directly. Out in the void the curve is broad, placing most of its weight on large noise levels. Slide the probe onto a datapoint and the curve collapses to a spike at $t\to 0$ , because an observation sitting on the data is indistinguishable from clean data.

Figure 2 · the posterior over the noise level

drag the probe; watch p(t | u) sharpen near the data

The posterior p(t | u) over the noise level, for a draggable probe. Far from the data it is broad and sits at large t; on a data point it collapses to a spike at t → 0. The most likely level t̂ is marked. The observation's position encodes the noise level, even when the model is never told it.

When the posterior is a spike at some $\hat{t}$ , the average in (6) reduces to the single conditioned field at $\hat{t}$ , so the blind field equals the model that knew the noise level all along. To make that concrete it helps to rewrite (6) in terms of the denoiser $D^*_t(\mathbf{u}) = \mathbb{E}[\mathbf{x}\mid\mathbf{u},t]$ , the best guess of the clean data at level $t$ :

f^*(\mathbf{u}) = \mathbb{E}_{t\mid\mathbf{u}}\!\left[\,\frac{d(t)}{b(t)}\,\mathbf{u} + \Big(c(t) - \frac{d(t)a(t)}{b(t)}\Big) D^*_t(\mathbf{u})\,\right]

(7)

Every parameterization is some affine mix of "where you are" ( $\mathbf{u}$ ) and "where the clean data probably is" ( $D^*_t$ ). So the blind model is Bayes-optimal averaging, and it matches the informed model exactly when the posterior over the noise level is peaked. The real question is no longer whether the average is reasonable; it is when the average is sharp. That is a question about geometry.

Reading the noise level off the geometry

In high dimensions, noise has a very predictable size, and a blind model relies on exactly that. Add Gaussian noise of level $\sigma$ to a point in $D$ dimensions and the noise vector's length is almost exactly $\sigma\sqrt{D}$ , with vanishing relative wiggle. (This is concentration of measure: a high-dimensional Gaussian puts essentially all its mass on a thin spherical shell, not near its center.)

So suppose the data does not fill the space but sits on a thin $d$ -dimensional manifold (a curved lower-dimensional sheet) inside a large $D$ -dimensional ambient space. The part of a noisy observation that sticks out off the manifold is pure noise in the remaining $D-d$ directions, so its length is about $\sigma\sqrt{D-d}$ . That length is essentially the noise level. Measuring the distance $r$ from the observation to the manifold gives an estimate

\hat{\sigma} = \frac{r}{\sqrt{D-d}}, \qquad \text{spread shrinking like } \frac{1}{\sqrt{D-d}}.

The bigger the gap between the ambient dimension and the manifold's intrinsic dimension, the tighter the estimate. Two different noise levels live on two shells of different radius; in low dimensions the shells are fat and overlap, so the level is ambiguous, and in high dimensions they are razor-thin and disjoint, so the level is read straight off the radius. Drag the dimension below and watch the two shells separate.

Figure 3 · the blessing of dimensionality

dimension DD = 8

The estimated noise level σ̂ = r/√(D−d) for a low and a high true level, as the ambient dimension D grows. The shaded overlap is the ambiguity. At D = 8 the two levels already separate; by D = 128 they are disjoint, so the geometry pins the noise level and the posterior p(t | u) collapses. High dimensions make the blind average exact.

The 1D plot above superimposes the two bells so you can watch their overlap shrink as $D$ grows. The shells themselves are a 3D-shaped picture: two concentric spheres of points around the single data point, fatter for high noise, thinner for low. Figure 4 draws the drawable case, $D=3$ with a single data point at the origin ( $d=0$ ), and lets you rotate in any direction to see both spheres at once. The real model lives in hundreds of dimensions where the same spheres are paper-thin and far apart, exactly the regime the 1D bells reach as you crank $D$ up.

Figure 4 · the two shells, in the drawable case

A rotatable

D=3

picture of a single data point with two shells of noisy observations around it. The teal inner shell sits at radius

\sigma_{\text{lo}}\sqrt{D-d}\approx 0.69

, the amber outer shell at

\sigma_{\text{hi}}\sqrt{D-d}\approx 1.73

. Drag in any direction to rotate; even at

D=3

the two shells are visibly distinct spheres. In real high dimensions the shells thin out and pull apart at exactly the rate the bells in Figure 3 separate. For a higher-dimensional data manifold the same shells become tubes (

d=1

, a line) or thicker sheets (

d=2

); the math is identical, only the rendering changes.

Two caveats. The geometry only makes the noise level recoverable from the observation; whether a trained network actually exploits the distance $r$ to recover it is an empirical fact, and the answer (from Sun et al.) is that it does. And there is a second, stronger reason the posterior is sharp, one that needs no high dimension at all. As an observation approaches the data manifold it becomes indistinguishable from clean data, so the smallest noise levels dominate and $p(t\mid\mathbf{u})$ concentrates at $t\to 0$ by sheer proximity. That near-manifold collapse keeps generation stable at the end, regardless of dimension.

So the noise level is not lost when you stop feeding it in. It is encoded in the observation's geometry, and the model recovers it from there. Wherever the posterior is peaked (which is almost everywhere) the blind field equals the informed field. But "equals a good field" is a statement about the target. Whether you can safely follow that field all the way down to the data is a separate question, and it has a geometric obstruction.

An infinitely deep well

The paper's reframing is that a blind model is not chasing a moving target at all. It is descending a single fixed landscape: the marginal energy. Define it as the negative log-likelihood of a noisy datapoint at some unknown noise level,

E_{\text{marg}}(\mathbf{u}) = -\log p(\mathbf{u}), \qquad p(\mathbf{u}) = \int p(\mathbf{u}\mid t)\,p(t)\,dt

(1)

where the integral averages the noisy-data density over a prior on the noise level. Generation is then rolling downhill on this one static potential toward where noisy data is most plausible, which is the clean data itself. A static field can generate because it is the gradient field of a static potential, so it needs no noise-level input, only a slope to follow downhill.

The slope is the posterior-averaged score. To write it in terms of the denoiser we use Tweedie's formula, the empirical-Bayes identity (Robbins 1956, Efron 2011) that the conditional score points from your noisy point toward your best guess of the clean one:

\nabla_{\mathbf{u}} \log p(\mathbf{u}\mid t) = \frac{a(t)\,D^*_t(\mathbf{u}) - \mathbf{u}}{b(t)^2}

(10)

Averaging that over the posterior on $t$ gives the gradient of the marginal energy:

\nabla_{\mathbf{u}} E_{\text{marg}}(\mathbf{u}) = \mathbb{E}_{t\mid\mathbf{u}}\!\left[\,\frac{\mathbf{u} - a(t)\,D^*_t(\mathbf{u})}{b(t)^2}\,\right]

(11)

That gradient carries a $1/b(t)^2$ , and $b(t)\to 0$ as you near the data, so the marginal energy has an infinitely deep, infinitely steep well at every datapoint:

\lim_{\mathbf{u}\to\mathbf{x}_k} \lVert \nabla_{\mathbf{u}} E_{\text{marg}}(\mathbf{u}) \rVert = \infty

(12)

A neural network outputs finite vectors. A finite field cannot equal an infinite gradient. So either a blind model cannot represent the landscape near the data, exactly where generation finishes and matters most, or something is rescuing it. The 3D view makes this concrete: lift the data plane and the marginal energy becomes a surface with pits at every data point. Toggle between the energy itself, the gradient magnitude, and the preconditioned field, and rotate to see how each surface treats the data.

Figure 5 · the landscape, in three views

The same three data points laid out on a 2D plane, with the field drawn as a rotatable landscape. The energy view shows pits diving to negative infinity at every point (the floor is clipped). The gradient view shows tall spikes diverging upward at every point: the singularity the paper formalizes in equation (12). The preconditioned view shows the smooth bounded bowl the gain leaves behind, with the data sitting at the floor as a stable equilibrium. Drag to rotate in any direction. The qualitative shape is the paper's claim; the constants are chosen for legibility.

The same math in one dimension makes the cancellation exact. Three data points sit on a line, and the three quantities involved (energy, raw gradient, and the field the model actually follows) are plotted as three curves. The raw gradient runs off the top of the chart while the bold teal product, the field f*, stays finite and crosses zero at every datum.

Figure 6 · the singularity, and the gain that cancels it

gradient ‖∇E‖ = 10.95gain λ̄ = 0.285field ‖f*‖ = 0.093 (bounded)

A one-dimensional toy with three data points symmetric around zero. The faint amber curve is the marginal energy, plunging into a deep well at each datum. The red curve is the raw gradient

\|\nabla E\|

, which diverges at the data (the spikes are clipped; tagged →∞). The bold teal curve is the magnitude

\|f^*\|

of the field the model follows. It stays bounded everywhere and vanishes at every data point, marking the three stable equilibria (teal dots at the baseline). The curve also dips to zero at the midpoints between adjacent data points; those are unstable saddles where the conditional mean of the data equals

u

by symmetry, harmless for the dynamics (the model would slide off either way). Drag the probe and watch the gain vanish at the same rate the gradient blows up.

The singularity cancels. The blind field decomposes into three pieces: a scaled copy of the energy gradient (the natural-gradient term, a gradient rescaled by a local metric), a transport correction, and a linear drift.

f^*(\mathbf{u}) = \underbrace{\overline{\lambda}(\mathbf{u})\,\nabla E_{\text{marg}}(\mathbf{u})}_{\text{natural gradient}} + \underbrace{\mathbb{E}_{t\mid\mathbf{u}}\big[(\lambda(t)-\overline{\lambda})(\nabla E_t - \nabla E_{\text{marg}})\big]}_{\text{transport correction}} + \underbrace{\overline{c}_{\text{scale}}(\mathbf{u})\,\mathbf{u}}_{\text{drift}}

(14)

Take the three terms in order. The first is the energy gradient scaled by an effective gain $\overline{\lambda}(\mathbf{u}) = \mathbb{E}_{t\mid\mathbf{u}}[\lambda(t)]$ , where $\lambda(t) = \tfrac{b}{a}(da - cb)$ . The second, the transport correction, measures how far the per-noise-level gradients sit from their own average; it is nonzero only when the posterior is spread across several levels, and it dies once the posterior concentrates on one. The third is a linear drift the noise schedule contributes. The gain cancels the singularity. Near the data $b\to 0$ , so the $1/b^2$ in the gradient sends it to infinity, while $\lambda$ carries factors of $b$ that vanish at the matching rate, leaving the product $\overline{\lambda}\,\nabla E_{\text{marg}}$ finite. It is like descending an infinitely steep slope while shortening your stride toward zero at the same rate, so every step stays finite. The paper calls this a Riemannian gradient flow: the gain acts as a local metric, a position-dependent rescaling of distance, that turns the singular landscape into a smooth, finite-speed descent with a stable resting point at the data.¹ Because the transport correction dies wherever the posterior concentrates (high dimension, or near the manifold), what is left is a clean preconditioned gradient flow.

The same decomposition holds at every point on the plane. The figure samples a grid of locations and, at each, draws the three terms of equation (14) head to tail: a teal natural-gradient arrow, then an amber transport-correction arrow, then a small drift arrow, with a dashed line showing the total. The data points sit inside amber rings; the chains shrink to zero as you approach them. Slide the posterior sharpness and watch the amber transport correction die away as the posterior concentrates; what is left is the teal natural-gradient flow plus the drift.

Figure 7 · the field, decomposed

posterior sharpness45%

Equation (14) on a 2D plane. At every grid point the three components of f*(u) are drawn head to tail: natural gradient (bounded, vanishes at data), transport correction (perpendicular, vanishes with posterior concentration and near data), and a small drift (linear in u). The dashed line is their sum, the field f* the model actually follows. Toggle to isolate one term or watch the sum directly. Slide the posterior sharpness from broad to concentrated and watch the transport correction die away. The qualitative shape of each term is what the paper proves; the toy lets you see them combine.

So the infinite well never appears in the field the model learns. That field is bounded, with the data as a stable attractor, which is what you saw the particles settle into in Figure 1. Whether a real numerical sampler can follow that bounded field down to the data is a separate question.

Which parameterization survives

Having a bounded field is not the same as being able to follow it; that is the question of the samplingdynamics. You integrate an ODE driven by the field, and a bounded field divided by a vanishing noise scale can still make that ODE stiff (one that forces a numerical solver into vanishingly small steps) and explosive. The sampler integrates

\frac{d\mathbf{u}}{dt} = \mu(t)\,\mathbf{u} + \nu(t)\,f^*(\mathbf{u})

(19)

where $\mu(t)$ is the schedule's drift and $\nu(t)$ is the effective gain of the parameterization. Even though the network $f^*$ is never told the noise level, the sampler still uses $t$ , because $\mu$ and $\nu$ depend on it.

To judge stability, compare the blind sampler against an oracle that knows the true $t$ at every step. Subtracting the two cancels the shared drift term and leaves the error introduced purely by being blind:

\Delta\mathbf{v}(\mathbf{u},t) = \underbrace{|\nu(t)|}_{\text{gain}} \cdot \underbrace{\lVert f^*(\mathbf{u}) - f^*_t(\mathbf{u}) \rVert}_{\text{estimation error}}

(22)

Stability is a race as $t\to 0$ . The estimation error falls (the posterior over the noise level concentrates), but the gain $\nu(t)$ may diverge. Which of the two wins decides the outcome, and the three standard targets land in three different places.

Noise prediction (DDPM, and DDIM, its deterministic-sampler variant) has gain $\nu\sim 1/b$ , which diverges. Worse, its estimation error does not fall to zero. It floors at a positive "Jensen gap": because $1/b$ is convex, the posterior-average of $1/b$ exceeds $1/(\text{average } b)$ , and that gap stays nonzero unless the posterior is a perfect spike. A diverging gain times a floored error goes to infinity.

Signal prediction (EDM) has an even worse gain, $\nu\sim 1/b^2$ . But near a discrete data manifold the denoising error decays exponentially, like $e^{-C/b^2}$ . Near a data point the denoiser is a softmax over the data points weighted by a Gaussian in distance, so once you sit close to one point every other point is suppressed by a Gaussian factor in its squared distance, and the error to the nearest point dies that fast. Exponential decay beats any polynomial blow-up, so the product goes to zero: a stronger gain singularity than noise prediction, yet still stable, because the error crashes faster than the gain climbs.

Velocity prediction (flow matching) has gain $\nu = 1$ , flat. There is nothing to amplify, so the error stays bounded and the dynamics are inherently stable. Toggle the target below and watch the drift error $\Delta\mathbf{v}$ as the noise vanishes: only noise prediction runs to infinity.

Figure 8 · the stability race

The drift error Δv = gain × estimation error as t → 0, for each target. Faint curves show the gain rising and the error falling; the bold curve is their product. Noise prediction diverges (unstable). Signal and velocity stay bounded. The single coefficient ν(t) sets the outcome.

This makes the title's claim precise. You can drop the noise-level input, but only if your parameterization has a bounded gain (velocity) or a self-correcting one (signal). Velocity-based models are inherently safe; noise-prediction models are structurally broken when blind. The difference is one coefficient, $\nu(t)$ : bounded for velocity, divergent for noise.

What collapses, what survives

On the CIFAR-10 numbers Sun et al. report, run blind, a noise-prediction DDIM collapses to FID $40.90$ , while blind flow matching and uEDM (both velocity) reach $2.61$ and $2.23$ . FID is a distance between the real and generated image distributions where lower is better, and on CIFAR-10 anything around $2{-}3$ is near the best published, while $40$ means the samples are visibly broken. That roughly eighteen-fold gap is not a tuning artifact. It is the $1/b$ gain singularity turning ordinary estimation noise into garbage. The paper's own runs on CIFAR-10, SVHN, and Fashion-MNIST show the same pattern by eye: blind noise-prediction images are dominated by high-frequency artifacts, while blind velocity images are sharp and match the conditioned models.

The dimensionality experiment makes the geometry visible. Take a 2D ring of data, embed it in a $D$ -dimensional space, and sweep $D$ . At $D=2$ both blind models struggle: the shells overlap, the posterior is ambiguous, and the samples are diffuse. At $D=8$ and $32$ velocity is already clean while noise prediction is scattered. At $D=128$ the concentration is so sharp that even the unstable noise-prediction model converges, because its estimation error finally crashes faster than its gain diverges.

So do diffusion models need noise conditioning? Not always: a velocity- or signal-based model can drop the explicit noise input, because the geometry supplies the level and the dynamics stay bounded, while a noise-prediction model effectively still needs it, because run blind it is structurally unstable. The clean argument leans on data living on a low-dimensional manifold inside a high-dimensional space, and the exponential-stability case for signal prediction assumes a discrete manifold, so the result is strongest exactly where real image data tends to live.

Wherever the geometry fixes the noise level, the blind model matches the informed one, and whether the sampler reaches the data comes down to the gain coefficient. For the stable parameterizations, then, the noise-level input that every diffusion model carries was information the observation already held. The model was never told the level; it recovers it from the geometry of its own input.

Provenance Verified against primary literature

Tweedie / Robbins / EfronA denoiser is the conditional score (Robbins 1956, Efron 2011, Vincent 2011); the signal-scaling a(t) factor is essential.

Sun et al. (2025)The empirical result that blind models generate; the CIFAR-10 FID figures are their benchmark, which this paper explains.

EDM / Flow Matching / EqMThe unified affine schedule and the (a,b,c,d) targets for each model (Karras 2022; Lipman 2022; Wang & Du).

Concentration of measureσ̂ = r/√(D−d) is the unbiased noise-level estimate; its spread shrinks like 1/√(D−d) (Vershynin; Amari for natural gradient).

caveatThe result is a touch narrower than the title suggests. Noise conditioning is dispensable only for stable (velocity- or signal-based) parameterizations; a noise-prediction model trained blind is structurally unstable. The clean argument also assumes data on a low-dimensional manifold in a high-dimensional space, and the exponential-stability case for signal prediction assumes a discrete manifold.

Questions you might still have

If the network never sees the noise level, how does it know how much to denoise?
In high dimensions the noise level is written into the geometry of the observation: the distance from the point to the data manifold fixes it (σ̂ = r/√(D−d)), so the posterior over the noise level collapses to a near-certain estimate. The model recovers the noise level from the geometry instead of being told it.

Then why does a blind DDPM fail while blind flow matching works?
Stability is a race as the noise vanishes. Noise prediction has a gain that grows like 1/b and amplifies the residual uncertainty (a nonzero "Jensen gap") into a blow-up; velocity prediction has a gain of exactly 1, so nothing amplifies the error.

Does this mean noise conditioning is useless?
No. The claim is narrower than that: a velocity- or signal-based model can drop the explicit noise input because the geometry supplies it and the dynamics stay bounded. A noise-prediction model cannot. And the clean theory leans on data sitting on a low-dimensional manifold inside a high-dimensional space.

What is the marginal energy, exactly?
It is −log p(u), where p(u) averages the noisy-data densities over every noise level. It is the single static landscape a blind model descends, and it has an infinitely deep well at every data point.

Footnotes & further reading

The gain acts as a conformal metric, a scalar function times the identity that locally rescales distance. If the update is written as gain times gradient, the implied metric is the inverse gain, so "the gain is the metric" is right in spirit but it should be pinned down in the algebra. ↩
The paper: Sahraee-Ardakan, Delbracio, Milanfar, The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning (Google, 2026).
The empirical result this explains: Sun, Jiang, Zhao, He, Is Noise Conditioning Necessary for Denoising Generative Models? (2025), which also supplies the unified affine schedule and the CIFAR-10 benchmark.
The diffusion design space and the EDM conventions: Karras et al., Elucidating the Design Space of Diffusion-Based Generative Models. Flow Matching: Lipman et al., Flow Matching for Generative Modeling.
Tweedie's formula and the denoiser-score link: Robbins (1956), Efron (2011), and Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011).
The concurrent, kindred analysis of blind denoising in high dimensions: Kadkhodaie et al., Blind Denoising Diffusion Models and the Blessings of Dimensionality. Natural gradient: Amari (1998).

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.