VerifiedarXiv:2602.1842826 min
Diffusion · Theory

The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning

A diffusion model doesn't need to be told how noisy its input is.

Standard diffusion models are handed the noise level at every step. Drop that input and the best ones keep working, because the geometry of high dimensions hands the noise level back. The catch is which kind of model you drop it from.

Explaining the paperThe Geometry of Noise: Why Diffusion Models Don't Need Noise ConditioningSahraee-Ardakan, Delbracio, Milanfar · Google · 2026 · arXiv:2602.18428

What if the noise-level input that every diffusion model is built around turns out to be redundant?

Every diffusion or flow model is built around a clock. You show the model a noisy version of the data and you also tell it how much noise is on the input: a single number, the noise level, usually written tt or σ\sigma. The model uses that number to decide how hard to clean. The whole architecture carries the noise level around, injecting it into every layer, because the folklore is that the model genuinely needs it. The right move at high noise (guess the rough shape of an image) is nothing like the right move at low noise (sharpen the last few details), and how would one network do both without being told where on the schedule it stands?

Recent work showed you can rip the clock out. Sun and collaborators (Is noise conditioning necessary?) trained a single network that sees only the noisy image, never the noise level, and it still generates clean samples. Equilibrium Matching trains the field to keep the data as its fixed points, with no time index at all, so it is noise-blind by design. The field calls this the autonomous (or noise-blind) setting: one static vector field f(u)f(\mathbf{u}), the same function of the input no matter what noise produced it. It is a real puzzle rather than a free lunch. A single static rulebook somehow has to serve pure noise and nearly-clean data both, and it has to stay stable right at the data, where, as we will see, the math wants to blow up.

This paper, from Sahraee-Ardakan, Delbracio, and Milanfar at Google, is the theory of why blind models work. It answers two questions. How does a model know the noise level it was never told? And why do some blind models generate beautifully while others collapse into static? The answer to the first is geometry; the answer to the second is a stability race. The argument is a tower: what the conditioned model is, what happens when you drop the clock, how high-dimensional geometry leaks the noise level back, what landscape the blind model is really descending, and why one parameterization survives the descent and another shatters.

Conditioning on the clock

The forward process is the easy half of any diffusion model: it takes a clean datapoint x\mathbf{x} and a noise level indexed by tt, and mixes them into a noisy observation.

ut=a(t)x+b(t)ϵ,ϵN(0,I)\mathbf{u}_t = a(t)\,\mathbf{x} + b(t)\,\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})
(2)

The two schedule functions do the bookkeeping: a(t)a(t) scales the signal and b(t)b(t) scales the noise. Near t0t \to 0 you sit on clean data; crank tt up and the noise term swamps everything until the observation is featureless. Different model families pick different a,ba,b. DDPM shrinks the signal as it adds noise so the total variance stays pinned (a2+b2=1a^2+b^2=1); EDM leaves the signal alone and lets the noise grow; flow matching slides linearly from data to noise with a=1ta=1-t, b=tb=t. The ratio that summarizes how much signal survives is the signal-to-noise ratio:

SNR(t)=a2(t)b2(t)\text{SNR}(t) = \frac{a^2(t)}{b^2(t)}
(3)

Training is a plain regression. You show the model a noisy ut\mathbf{u}_t and ask it to predict a linear target r=c(t)x+d(t)ϵr = c(t)\mathbf{x} + d(t)\boldsymbol{\epsilon}, scoring squared error:

L(f)=Ex,ϵ,t[f(ut)(c(t)x+d(t)ϵ)2]\mathcal{L}(f) = \mathbb{E}_{\mathbf{x},\boldsymbol{\epsilon},t}\big[\,\lVert f(\mathbf{u}_t) - (c(t)\mathbf{x} + d(t)\boldsymbol{\epsilon}) \rVert^2\,\big]
(4)

Squared error has one minimizer, and you already know it: the average of the thing you are predicting, conditioned on everything you can see. So the best conditioned model is the conditional mean of the target, given the observation and the noise level.

ft(u)=Ex,ϵu,t[c(t)x+d(t)ϵ]f^*_t(\mathbf{u}) = \mathbb{E}_{\mathbf{x},\boldsymbol{\epsilon}\mid\mathbf{u},t}\big[\,c(t)\mathbf{x} + d(t)\boldsymbol{\epsilon}\,\big]
(5)

This is a time-dependent vector field: a different arrow at every (position, noise level) pair. The four coefficients (a,b,c,d)(a,b,c,d) are the only difference between the famous models. DDPM predicts the noise ϵ\boldsymbol{\epsilon} (c=0,d=1c{=}0, d{=}1), EDM predicts the clean signal x\mathbf{x} (c=1,d=0c{=}1, d{=}0), flow matching predicts a velocity (c=1,d=1c{=}{-}1, d{=}1). The tt input is the switch that lets one network be a different function at each noise level.

And you can feel why a switch seems necessary. At high noise the field should haul any point toward the blurry average of all the data, since that is genuinely the best guess when you can barely see. At low noise it should pin a point to the nearest sharp datapoint. Those are opposite fields, and tt is what tells the network which one to be. Drop tt and you seem to lose the switch. The rest of the paper is about why you do not.

Dropping the clock

Take the noise level away. Now the network sees only u\mathbf{u} and must output a single vector f(u)f(\mathbf{u}), the same function no matter what noise produced the input. The optimal such field is not mysterious. It is still the least-squares answer, but you can no longer condition on tt because you do not have it, so the best you can do is average over your uncertainty about it:

f(u)=Etu[ft(u)]f^*(\mathbf{u}) = \mathbb{E}_{t\mid\mathbf{u}}\big[\,f^*_t(\mathbf{u})\,\big]
(6)

The weight in that average is the posterior over the noise level, p(tu)p(t\mid\mathbf{u}): your belief about which tt produced this observation, by Bayes' rule from the likelihood of seeing u\mathbf{u} at each level. So the optimal blind field is a posterior-weighted blend of all the conditioned fields. This is just the law of iterated expectations, the same move that says your best blind guess is the average of your informed guesses, each weighted by how likely its scenario is.

That blend is enough to generate. Below is the optimal blind field for a small five-point dataset, computed in closed form. Watch one thing: the field never changes. There is no clock feeding it. Press play and a ring of pure-noise particles rides this one frozen field inward and settles exactly onto the data, where the field vanishes, then stops.

Figure 1 · one static field generates
step 0
A single time-invariant field f*(u), with no noise level fed in anywhere. Press play or scrub: noise particles follow this one unchanging field onto the data points and stop. The data are stable equilibria, where the field is zero. Autonomous generation: one frozen field carries noise to data.

Now the puzzle sharpens. How can an average over noise levels be the right field anywhere? At a given u\mathbf{u}, if the posterior p(tu)p(t\mid\mathbf{u}) is spread across many levels, then (6) blends a high-noise field (points to the center) with a low-noise field (points to a specific mode), and the average of two contradictory arrows should be mush. The resolution is that the posterior is usually not spread at all.

Drag the probe below and watch p(tu)p(t\mid\mathbf{u}) directly. Out in the void the curve is broad: the model is genuinely unsure of the noise level, and it rightly leans toward "a lot." Slide the probe onto a datapoint and the curve collapses to a spike at t0t\to 0, because an observation sitting on the data is indistinguishable from clean data.

Figure 2 · the posterior over the noise level
drag the probe; watch p(t | u) sharpen near the data
The posterior p(t | u) over the noise level, for a draggable probe. Far from the data it is broad and sits at large t; on a data point it collapses to a spike at t → 0. The most likely level t̂ is marked. The observation's position encodes the noise level, even with no clock fed in.

When the posterior is a spike at some t^\hat{t}, the average in (6) is just the single conditioned field at t^\hat{t}, so the blind field equals the model that knew the noise level all along. To make that concrete it helps to rewrite (6) in terms of the denoiser Dt(u)=E[xu,t]D^*_t(\mathbf{u}) = \mathbb{E}[\mathbf{x}\mid\mathbf{u},t], the best guess of the clean data at level tt:

f(u)=Etu ⁣[d(t)b(t)u+(c(t)d(t)a(t)b(t))Dt(u)]f^*(\mathbf{u}) = \mathbb{E}_{t\mid\mathbf{u}}\!\left[\,\frac{d(t)}{b(t)}\,\mathbf{u} + \Big(c(t) - \frac{d(t)a(t)}{b(t)}\Big) D^*_t(\mathbf{u})\,\right]
(7)

Every parameterization is some affine mix of "where you are" (u\mathbf{u}) and "where the clean data probably is" (DtD^*_t). So the blind model is Bayes-optimal averaging, and it matches the informed model exactly when the posterior over the noise level is sharp. The real question is no longer whether the average is reasonable; it is when the average is sharp. Geometry answers that.

Reading the clock off the geometry

In high dimensions, noise has a very predictable size, and that is what lets a blind model cope. Add Gaussian noise of level σ\sigma to a point in DD dimensions and the noise vector's length is almost exactly σD\sigma\sqrt{D}, with vanishing relative wiggle. (This is concentration of measure: a high-dimensional Gaussian puts essentially all its mass on a thin spherical shell, not near its center.)

So suppose the data does not fill the space but sits on a thin dd-dimensional manifold (a curved lower-dimensional sheet) inside a large DD-dimensional ambient space. The part of a noisy observation that sticks out off the manifold is pure noise in the remaining DdD-d directions, so its length is about σDd\sigma\sqrt{D-d}. That length basically is the noise level. Measuring the distance rr from the observation to the manifold gives an estimate

σ^=rDd,spread shrinking like 1Dd.\hat{\sigma} = \frac{r}{\sqrt{D-d}}, \qquad \text{spread shrinking like } \frac{1}{\sqrt{D-d}}.

The bigger the gap between the ambient dimension and the manifold's intrinsic dimension, the sharper the estimate. Two different noise levels live on two shells of different radius; in low dimensions the shells are fat and overlap, so the level is ambiguous, and in high dimensions they are razor-thin and disjoint, so the level is read off for free. Drag the dimension below and watch the two shells separate.

Figure 3 · the blessing of dimensionality
D = 8
The estimated noise level σ̂ = r/√(D−d) for a low and a high true level, as the ambient dimension D grows. The shaded overlap is the ambiguity. At D = 8 the two levels already separate; by D = 128 they are disjoint, so the geometry pins the noise level and the posterior p(t | u) collapses. High dimensions make the blind average sharp.

The 1D plot above lays the bells on top of each other so the overlap shrinks visibly with DD. The shells themselves are a 3D-shaped picture: two concentric spheres of points around the single data point, fatter for high noise, thinner for low. Figure 4 draws the drawable case, D=3D=3 with a single data point at the origin (d=0d=0), and lets you rotate in any direction to see both spheres at once. The real model lives in hundreds of dimensions where the same spheres are paper-thin and far apart, exactly the regime the 1D bells reach as you crank DD up.

Figure 4 · the two shells, in the drawable case
A rotatable D=3D=3 picture of a single data point with two shells of noisy observations around it. The teal inner shell sits at radius σloDd0.69\sigma_{\text{lo}}\sqrt{D-d}\approx 0.69, the amber outer shell at σhiDd1.73\sigma_{\text{hi}}\sqrt{D-d}\approx 1.73. Drag in any direction to rotate; even at D=3D=3 the two shells are visibly distinct spheres. In real high dimensions the shells thin out and pull apart at exactly the rate the bells in Figure 3 separate. For a higher-dimensional data manifold the same shells become tubes (d=1d=1, a line) or thicker sheets (d=2d=2); the math is identical, only the rendering changes.

Two caveats. The geometry only makes the noise level recoverable from the observation; whether a trained network actually exploits the distance rr to recover it is an empirical fact, and the answer (from Sun et al.) is that it does. And there is a second, stronger reason the posterior is sharp, one that needs no high dimension at all. As an observation approaches the data manifold it becomes indistinguishable from clean data, so the smallest noise levels dominate and p(tu)p(t\mid\mathbf{u}) concentrates at t0t\to 0 by sheer proximity. That near-manifold collapse is what keeps generation stable at the finish line, regardless of dimension.

So the clock is not gone. It is encoded in the picture, and the model reads it off. Wherever the posterior is sharp (which is almost everywhere) the blind field equals the informed field. But "equals a good field" is a statement about the target. Whether you can safely follow that field all the way down to the data is a separate question, and it has a sharp geometric obstruction.

An infinitely deep well

The paper's reframing is that a blind model is not chasing a moving target at all. It is descending a single fixed landscape: the marginal energy. Define it as the negative log-likelihood of a noisy datapoint at some unknown noise level,

Emarg(u)=logp(u),p(u)=p(ut)p(t)dtE_{\text{marg}}(\mathbf{u}) = -\log p(\mathbf{u}), \qquad p(\mathbf{u}) = \int p(\mathbf{u}\mid t)\,p(t)\,dt
(1)

where the integral averages the noisy-data density over a prior on the noise level. Generation is then rolling downhill on this one static potential toward where noisy data is most plausible, which is the clean data itself. That is the deep reason a static field can generate at all: it is the gradient field of a static potential, so it needs no clock, just a slope.

The slope is the posterior-averaged score. To write it in terms of the denoiser we use Tweedie's formula, the empirical-Bayes identity (Robbins 1956, Efron 2011) that the conditional score points from your noisy point toward your best guess of the clean one:

ulogp(ut)=a(t)Dt(u)ub(t)2\nabla_{\mathbf{u}} \log p(\mathbf{u}\mid t) = \frac{a(t)\,D^*_t(\mathbf{u}) - \mathbf{u}}{b(t)^2}
(10)

Averaging that over the posterior on tt gives the gradient of the marginal energy:

uEmarg(u)=Etu ⁣[ua(t)Dt(u)b(t)2]\nabla_{\mathbf{u}} E_{\text{marg}}(\mathbf{u}) = \mathbb{E}_{t\mid\mathbf{u}}\!\left[\,\frac{\mathbf{u} - a(t)\,D^*_t(\mathbf{u})}{b(t)^2}\,\right]
(11)

That gradient hides the problem. It carries a 1/b(t)21/b(t)^2, and b(t)0b(t)\to 0 as you near the data, so the marginal energy has an infinitely deep, infinitely steep well at every datapoint:

limuxkuEmarg(u)=\lim_{\mathbf{u}\to\mathbf{x}_k} \lVert \nabla_{\mathbf{u}} E_{\text{marg}}(\mathbf{u}) \rVert = \infty
(12)

A neural network outputs finite vectors. A finite field cannot equal an infinite gradient. So either a blind model cannot represent the landscape near the data, exactly where generation finishes and matters most, or something is rescuing it. Picture the landscape first: lift the data plane and the marginal energy becomes a 3D surface with pits at every data point. Toggle between the energy itself, the gradient magnitude, and the preconditioned field, and rotate to see how each surface treats the data.

Figure 5 · the landscape, in three views
The same three data points laid out on a 2D plane, with the field drawn as a rotatable landscape. The energy view shows pits diving to negative infinity at every point (the floor is clipped). The gradient view shows tall spikes diverging upward at every point: the singularity the paper formalizes in equation (12). The preconditioned view shows the smooth bounded bowl the gain leaves behind, with the data sitting at the floor as a stable equilibrium. Drag to rotate in any direction. The qualitative shape is the paper's claim; the constants are chosen for legibility.

A 1D slice through the same picture pins down the math, with the three surfaces plotted as three curves along a line through the data. Drag the probe and watch the gain vanish at the same rate the gradient blows up: the red curve runs off the top of the chart while the bold teal product, the field the model actually follows, stays finite and crosses zero at every datum.

Figure 6 · the singularity, and the gain that cancels it
gradient ‖∇E‖ = 0.63gain λ̄ = 2.627field ‖f*‖ = 0.051 (bounded)
A one-dimensional slice through the three data points. The faint amber curve is the marginal energy, plunging into a deep well at each. The red curve is the raw gradient ‖∇E‖, which diverges at the data. The bold teal field f* stays bounded and crosses zero at each data point: a stable resting place inside an infinite well. Drag the probe and watch the gain vanish at the same rate the gradient blows up.

The rescue is a cancellation. The blind field decomposes into three pieces: a scaled copy of the energy gradient (the natural-gradient term), a transport correction, and a linear drift.

f(u)=λ(u)Emarg(u)natural gradient+Etu[(λ(t)λ)(EtEmarg)]transport correction+cscale(u)udriftf^*(\mathbf{u}) = \underbrace{\overline{\lambda}(\mathbf{u})\,\nabla E_{\text{marg}}(\mathbf{u})}_{\text{natural gradient}} + \underbrace{\mathbb{E}_{t\mid\mathbf{u}}\big[(\lambda(t)-\overline{\lambda})(\nabla E_t - \nabla E_{\text{marg}})\big]}_{\text{transport correction}} + \underbrace{\overline{c}_{\text{scale}}(\mathbf{u})\,\mathbf{u}}_{\text{drift}}
(14)

Read the three terms in order. The first is the energy gradient scaled by an effective gain λ(u)=Etu[λ(t)]\overline{\lambda}(\mathbf{u}) = \mathbb{E}_{t\mid\mathbf{u}}[\lambda(t)], where λ(t)=ba(dacb)\lambda(t) = \tfrac{b}{a}(da - cb). The second, the transport correction, measures how far the per-noise-level gradients sit from their own average; it is nonzero only when the posterior is spread across several levels, and it dies once the posterior concentrates on one. The third is a linear drift the noise schedule contributes. The gain is what defuses the singularity. Near the data b0b\to 0, so the 1/b21/b^2 in the gradient sends it to infinity, while λ\lambda carries factors of bb that vanish at the matching rate, leaving the product λEmarg\overline{\lambda}\,\nabla E_{\text{marg}} finite. It is like walking down an infinitely steep cliff while shortening your stride to zero at the matching rate, so every step stays finite. The paper calls this a Riemannian gradient flow: the gain acts as a local metric, a position-dependent rescaling of distance, that turns the singular landscape into a smooth, finite-speed descent with a stable resting point at the data.1 Because the transport correction dies wherever the posterior concentrates (high dimension, or near the manifold), what is left is a clean preconditioned gradient flow.

See it at every point on the plane. The figure samples a grid of locations and, at each, draws the three terms of equation (14) head to tail: a teal natural-gradient arrow, then an amber transport-correction arrow, then a small drift arrow, with a dashed line showing the total. The data points sit inside amber rings; the chains shrink to zero as you approach them. Slide the posterior sharpness and watch the amber transport correction die away as the posterior concentrates; what is left is the teal natural-gradient flow plus the drift.

Figure 7 · the field, decomposed
Equation (14) on a 2D plane. At every grid point the three components of f*(u) are drawn head to tail: natural gradient (bounded, vanishes at data), transport correction (perpendicular, vanishes with posterior concentration and near data), and a small drift (linear in u). The dashed line is their sum, the field f* the model actually follows. Toggle to isolate one term or watch the sum directly. Slide the posterior sharpness from broad to concentrated and watch the transport correction die away. Honest schematic: the qualitative shape of each term is what the paper proves; the toy lets you see them combine.

So the infinite well is a mirage from the model's point of view. The field it learns is bounded, with the data as a stable attractor, which is what you saw the particles settle into in Figure 1. The target is fine. But following a fine target with a real numerical sampler is where blind models actually live or die.

Which parameterization survives

Keep the two questions apart, because conflating them is the easiest mistake here. The last section showed the learned target is bounded. This section is about the sampling dynamics: you do not get the field for free, you integrate an ODE with it, and a bounded target divided by a vanishing noise scale can still produce a stiff, explosive equation. The sampler integrates

dudt=μ(t)u+ν(t)f(u)\frac{d\mathbf{u}}{dt} = \mu(t)\,\mathbf{u} + \nu(t)\,f^*(\mathbf{u})
(19)

where μ(t)\mu(t) is the schedule's drift and ν(t)\nu(t) is the effective gain of the parameterization. Note the subtlety the paper is careful about: even though the network ff^* has no clock, the sampler does, because μ\mu and ν\nu depend on tt. The architecture is autonomous; the integration schedule around it is not.

To judge stability, compare the blind sampler against an oracle that knows the true tt at every step. Subtracting the two cancels the shared drift term and leaves the error introduced purely by being blind:

Δv(u,t)=ν(t)gainf(u)ft(u)estimation error\Delta\mathbf{v}(\mathbf{u},t) = \underbrace{|\nu(t)|}_{\text{gain}} \cdot \underbrace{\lVert f^*(\mathbf{u}) - f^*_t(\mathbf{u}) \rVert}_{\text{estimation error}}
(22)

Stability is a race as t0t\to 0. The estimation error falls (the posterior over the noise level concentrates, from the last section), but the gain ν(t)\nu(t) may diverge. Whoever wins decides everything, and the three standard targets land in three different places.

Noise prediction (DDPM, DDIM) has gain ν1/b\nu\sim 1/b, which diverges. Worse, its estimation error does not fall to zero. It floors at a positive "Jensen gap": because 1/b1/b is convex, the posterior-average of 1/b1/b exceeds 1/(average b)1/(\text{average } b), and that gap stays nonzero unless the posterior is a perfect spike. A diverging gain times a floored error goes to infinity. Blind noise prediction blows up.

Signal prediction (EDM) has an even worse gain, ν1/b2\nu\sim 1/b^2. But near a discrete data manifold the denoising error decays exponentially, like eC/b2e^{-C/b^2}. Near a data point the denoiser is a softmax over the data points weighted by a Gaussian in distance, so once you sit close to one point every other point is suppressed by a Gaussian factor in its squared distance, and the error to the nearest point dies that fast. Exponential decay beats any polynomial blow-up, so the product goes to zero: a stronger gain singularity, and still stable, because the error crashes faster than the gain climbs.

Velocity prediction (flow matching) has gain ν=1\nu = 1, flat. There is nothing to amplify, so the error stays bounded and the dynamics are inherently stable. Toggle the target below and watch the drift error Δv\Delta\mathbf{v} as the noise vanishes: only noise prediction runs to infinity.

Figure 8 · the stability race
The drift error Δv = gain × estimation error as t → 0, for each target. Faint curves show the gain rising and the error falling; the bold curve is their product. Noise prediction diverges (unstable). Signal and velocity stay bounded. The same blindness, opposite fates, decided by the single coefficient ν(t).

That is the title's claim, made precise. You can drop the noise-level input, but only if your parameterization has a bounded gain (velocity) or a self-correcting one (signal). Velocity-based models are inherently safe; noise-prediction models are structurally broken when blind. The difference is one coefficient, ν(t)\nu(t): bounded for velocity, divergent for noise.

What this actually says

On the CIFAR-10 numbers Sun et al. report, run blind, a noise-prediction DDIM collapses to FID 40.9040.90, while blind flow matching and uEDM (both velocity) reach 2.612.61 and 2.232.23. FID is a distance between the real and generated image distributions where lower is better, and on CIFAR-10 anything around 232{-}3 is near the best published, while 4040 means the samples are visibly broken. That roughly eighteen-fold gap is not a tuning artifact. It is the 1/b1/b gain singularity turning ordinary estimation noise into garbage. The paper's own runs on CIFAR-10, SVHN, and Fashion-MNIST show the same pattern by eye: blind noise-prediction images are dominated by high-frequency artifacts, while blind velocity images are sharp and match the conditioned models.

The dimensionality experiment makes the geometry visible. Take a 2D ring of data, embed it in a DD-dimensional space, and sweep DD. At D=2D=2 both blind models struggle: the shells overlap, the posterior is ambiguous, and the samples are diffuse. At D=8D=8 and 3232 velocity is already clean while noise prediction is scattered. At D=128D=128 the concentration is so sharp that even the unstable noise-prediction model converges, because its estimation error finally crashes faster than its gain diverges. The blessing of dimensionality, start to finish.

So do diffusion models need noise conditioning? The short answer is no. The precise answer is that a velocity- or signal-based model can drop the explicit noise input, because the geometry supplies the level and the dynamics stay bounded, while a noise-prediction model effectively still needs it, because run blind it is structurally unstable. The clean argument leans on data living on a low-dimensional manifold inside a high-dimensional space, and the exponential-stability case for signal prediction assumes a discrete manifold, so the result is sharpest exactly where real image data tends to live.

Step back and the whole thing is four facts. A blind model is the posterior-weighted average of the models that knew the noise level. High-dimensional geometry makes that posterior sharp, so the average is accurate. The landscape it descends has an infinite singularity at the data, but a vanishing gain preconditions it into a stable attractor. And whether the sampler survives the descent comes down to one gain coefficient. Put together, the clock that every diffusion model carries was, for the stable parameterizations, information the geometry already held. The model was never really using the clock. It was reading the time off the noise.

Provenance Verified against primary literature
Tweedie / Robbins / EfronA denoiser is the conditional score (Robbins 1956, Efron 2011, Vincent 2011); the signal-scaling a(t) factor is essential.
Sun et al. (2025)The empirical result that blind models generate; the CIFAR-10 FID figures are their benchmark, which this paper explains.
EDM / Flow Matching / EqMThe unified affine schedule and the (a,b,c,d) targets for each model (Karras 2022; Lipman 2022; Wang & Du).
Concentration of measureσ̂ = r/√(D−d) is the unbiased noise-level estimate; its spread shrinks like 1/√(D−d) (Vershynin; Amari for natural gradient).
correctionThe title overstates the result. Noise conditioning is dispensable only for stable (velocity- or signal-based) parameterizations; a noise-prediction model trained blind is structurally unstable. The clean argument also assumes data on a low-dimensional manifold in a high-dimensional space, and the exponential-stability case for signal prediction assumes a discrete manifold.

Questions you might still have

?

If the network never sees the noise level, how does it know how much to denoise?
In high dimensions the noise level is written into the geometry of the observation: the distance from the point to the data manifold pins it down (σ̂ = r/√(D−d)), so the posterior over the noise level collapses to a near-certain estimate. The model reads the clock off the picture.

?

Then why does a blind DDPM fail while blind flow matching works?
Stability is a race as the noise vanishes. Noise prediction has a gain that grows like 1/b and amplifies the residual uncertainty (a nonzero "Jensen gap") into a blow-up; velocity prediction has a gain of exactly 1, so nothing amplifies the error. Same blindness, opposite fate.

?

Does this mean noise conditioning is useless?
No. The honest claim is narrower: a velocity- or signal-based model can drop the explicit noise input because the geometry supplies it and the dynamics stay bounded. A noise-prediction model cannot. And the clean theory leans on data sitting on a low-dimensional manifold inside a high-dimensional space.

?

What is the marginal energy, exactly?
It is −log p(u), where p(u) averages the noisy-data densities over every noise level. It is the single static landscape a blind model descends, and it has an infinitely deep well at every data point.

Footnotes & further reading

  1. The gain acts as a conformal metric, a scalar function times the identity that locally rescales distance. One bookkeeping subtlety: if the update is written as gain times gradient, the implied metric is the inverse gain, so "the gain is the metric" is right in spirit but worth pinning down in the algebra.
  2. The paper: Sahraee-Ardakan, Delbracio, Milanfar, The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning (Google, 2026).
  3. The empirical result this explains: Sun, Jiang, Zhao, He, Is Noise Conditioning Necessary for Denoising Generative Models? (2025), which also supplies the unified affine schedule and the CIFAR-10 benchmark.
  4. The diffusion design space and the EDM conventions: Karras et al., Elucidating the Design Space of Diffusion-Based Generative Models. Flow Matching: Lipman et al., Flow Matching for Generative Modeling.
  5. Tweedie's formula and the denoiser-score link: Robbins (1956), Efron (2011), and Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011).
  6. The concurrent, kindred analysis of blind denoising in high dimensions: Kadkhodaie et al., Blind Denoising Diffusion Models and the Blessings of Dimensionality. Natural gradient: Amari (1998).