The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning
A diffusion model doesn't need to be told how noisy its input is.
Standard diffusion models are handed the noise level at every step. Drop that input and the best ones keep working, because the geometry of high dimensions hands the noise level back. The catch is which kind of model you drop it from.
Explaining the paperThe Geometry of Noise: Why Diffusion Models Don't Need Noise ConditioningWhat if the noise-level input that every diffusion model is built around turns out to be redundant?
Every diffusion or flow model is built around a clock. You show the model a noisy version of the data and you also tell it how much noise is on the input: a single number, the noise level, usually written or . The model uses that number to decide how hard to clean. The whole architecture carries the noise level around, injecting it into every layer, because the folklore is that the model genuinely needs it. The right move at high noise (guess the rough shape of an image) is nothing like the right move at low noise (sharpen the last few details), and how would one network do both without being told where on the schedule it stands?
Recent work showed you can rip the clock out. Sun and collaborators (Is noise conditioning necessary?) trained a single network that sees only the noisy image, never the noise level, and it still generates clean samples. Equilibrium Matching trains the field to keep the data as its fixed points, with no time index at all, so it is noise-blind by design. The field calls this the autonomous (or noise-blind) setting: one static vector field , the same function of the input no matter what noise produced it. It is a real puzzle rather than a free lunch. A single static rulebook somehow has to serve pure noise and nearly-clean data both, and it has to stay stable right at the data, where, as we will see, the math wants to blow up.
This paper, from Sahraee-Ardakan, Delbracio, and Milanfar at Google, is the theory of why blind models work. It answers two questions. How does a model know the noise level it was never told? And why do some blind models generate beautifully while others collapse into static? The answer to the first is geometry; the answer to the second is a stability race. The argument is a tower: what the conditioned model is, what happens when you drop the clock, how high-dimensional geometry leaks the noise level back, what landscape the blind model is really descending, and why one parameterization survives the descent and another shatters.
Conditioning on the clock
The forward process is the easy half of any diffusion model: it takes a clean datapoint and a noise level indexed by , and mixes them into a noisy observation.
The two schedule functions do the bookkeeping: scales the signal and scales the noise. Near you sit on clean data; crank up and the noise term swamps everything until the observation is featureless. Different model families pick different . DDPM shrinks the signal as it adds noise so the total variance stays pinned (); EDM leaves the signal alone and lets the noise grow; flow matching slides linearly from data to noise with , . The ratio that summarizes how much signal survives is the signal-to-noise ratio:
Training is a plain regression. You show the model a noisy and ask it to predict a linear target , scoring squared error:
Squared error has one minimizer, and you already know it: the average of the thing you are predicting, conditioned on everything you can see. So the best conditioned model is the conditional mean of the target, given the observation and the noise level.
This is a time-dependent vector field: a different arrow at every (position, noise level) pair. The four coefficients are the only difference between the famous models. DDPM predicts the noise (), EDM predicts the clean signal (), flow matching predicts a velocity (). The input is the switch that lets one network be a different function at each noise level.
And you can feel why a switch seems necessary. At high noise the field should haul any point toward the blurry average of all the data, since that is genuinely the best guess when you can barely see. At low noise it should pin a point to the nearest sharp datapoint. Those are opposite fields, and is what tells the network which one to be. Drop and you seem to lose the switch. The rest of the paper is about why you do not.
Dropping the clock
Take the noise level away. Now the network sees only and must output a single vector , the same function no matter what noise produced the input. The optimal such field is not mysterious. It is still the least-squares answer, but you can no longer condition on because you do not have it, so the best you can do is average over your uncertainty about it:
The weight in that average is the posterior over the noise level, : your belief about which produced this observation, by Bayes' rule from the likelihood of seeing at each level. So the optimal blind field is a posterior-weighted blend of all the conditioned fields. This is just the law of iterated expectations, the same move that says your best blind guess is the average of your informed guesses, each weighted by how likely its scenario is.
That blend is enough to generate. Below is the optimal blind field for a small five-point dataset, computed in closed form. Watch one thing: the field never changes. There is no clock feeding it. Press play and a ring of pure-noise particles rides this one frozen field inward and settles exactly onto the data, where the field vanishes, then stops.
Now the puzzle sharpens. How can an average over noise levels be the right field anywhere? At a given , if the posterior is spread across many levels, then (6) blends a high-noise field (points to the center) with a low-noise field (points to a specific mode), and the average of two contradictory arrows should be mush. The resolution is that the posterior is usually not spread at all.
Drag the probe below and watch directly. Out in the void the curve is broad: the model is genuinely unsure of the noise level, and it rightly leans toward "a lot." Slide the probe onto a datapoint and the curve collapses to a spike at , because an observation sitting on the data is indistinguishable from clean data.
When the posterior is a spike at some , the average in (6) is just the single conditioned field at , so the blind field equals the model that knew the noise level all along. To make that concrete it helps to rewrite (6) in terms of the denoiser , the best guess of the clean data at level :
Every parameterization is some affine mix of "where you are" () and "where the clean data probably is" (). So the blind model is Bayes-optimal averaging, and it matches the informed model exactly when the posterior over the noise level is sharp. The real question is no longer whether the average is reasonable; it is when the average is sharp. Geometry answers that.
Reading the clock off the geometry
In high dimensions, noise has a very predictable size, and that is what lets a blind model cope. Add Gaussian noise of level to a point in dimensions and the noise vector's length is almost exactly , with vanishing relative wiggle. (This is concentration of measure: a high-dimensional Gaussian puts essentially all its mass on a thin spherical shell, not near its center.)
So suppose the data does not fill the space but sits on a thin -dimensional manifold (a curved lower-dimensional sheet) inside a large -dimensional ambient space. The part of a noisy observation that sticks out off the manifold is pure noise in the remaining directions, so its length is about . That length basically is the noise level. Measuring the distance from the observation to the manifold gives an estimate
The bigger the gap between the ambient dimension and the manifold's intrinsic dimension, the sharper the estimate. Two different noise levels live on two shells of different radius; in low dimensions the shells are fat and overlap, so the level is ambiguous, and in high dimensions they are razor-thin and disjoint, so the level is read off for free. Drag the dimension below and watch the two shells separate.
The 1D plot above lays the bells on top of each other so the overlap shrinks visibly with . The shells themselves are a 3D-shaped picture: two concentric spheres of points around the single data point, fatter for high noise, thinner for low. Figure 4 draws the drawable case, with a single data point at the origin (), and lets you rotate in any direction to see both spheres at once. The real model lives in hundreds of dimensions where the same spheres are paper-thin and far apart, exactly the regime the 1D bells reach as you crank up.
Two caveats. The geometry only makes the noise level recoverable from the observation; whether a trained network actually exploits the distance to recover it is an empirical fact, and the answer (from Sun et al.) is that it does. And there is a second, stronger reason the posterior is sharp, one that needs no high dimension at all. As an observation approaches the data manifold it becomes indistinguishable from clean data, so the smallest noise levels dominate and concentrates at by sheer proximity. That near-manifold collapse is what keeps generation stable at the finish line, regardless of dimension.
So the clock is not gone. It is encoded in the picture, and the model reads it off. Wherever the posterior is sharp (which is almost everywhere) the blind field equals the informed field. But "equals a good field" is a statement about the target. Whether you can safely follow that field all the way down to the data is a separate question, and it has a sharp geometric obstruction.
An infinitely deep well
The paper's reframing is that a blind model is not chasing a moving target at all. It is descending a single fixed landscape: the marginal energy. Define it as the negative log-likelihood of a noisy datapoint at some unknown noise level,
where the integral averages the noisy-data density over a prior on the noise level. Generation is then rolling downhill on this one static potential toward where noisy data is most plausible, which is the clean data itself. That is the deep reason a static field can generate at all: it is the gradient field of a static potential, so it needs no clock, just a slope.
The slope is the posterior-averaged score. To write it in terms of the denoiser we use Tweedie's formula, the empirical-Bayes identity (Robbins 1956, Efron 2011) that the conditional score points from your noisy point toward your best guess of the clean one:
Averaging that over the posterior on gives the gradient of the marginal energy:
That gradient hides the problem. It carries a , and as you near the data, so the marginal energy has an infinitely deep, infinitely steep well at every datapoint:
A neural network outputs finite vectors. A finite field cannot equal an infinite gradient. So either a blind model cannot represent the landscape near the data, exactly where generation finishes and matters most, or something is rescuing it. Picture the landscape first: lift the data plane and the marginal energy becomes a 3D surface with pits at every data point. Toggle between the energy itself, the gradient magnitude, and the preconditioned field, and rotate to see how each surface treats the data.
A 1D slice through the same picture pins down the math, with the three surfaces plotted as three curves along a line through the data. Drag the probe and watch the gain vanish at the same rate the gradient blows up: the red curve runs off the top of the chart while the bold teal product, the field the model actually follows, stays finite and crosses zero at every datum.
The rescue is a cancellation. The blind field decomposes into three pieces: a scaled copy of the energy gradient (the natural-gradient term), a transport correction, and a linear drift.
Read the three terms in order. The first is the energy gradient scaled by an effective gain , where . The second, the transport correction, measures how far the per-noise-level gradients sit from their own average; it is nonzero only when the posterior is spread across several levels, and it dies once the posterior concentrates on one. The third is a linear drift the noise schedule contributes. The gain is what defuses the singularity. Near the data , so the in the gradient sends it to infinity, while carries factors of that vanish at the matching rate, leaving the product finite. It is like walking down an infinitely steep cliff while shortening your stride to zero at the matching rate, so every step stays finite. The paper calls this a Riemannian gradient flow: the gain acts as a local metric, a position-dependent rescaling of distance, that turns the singular landscape into a smooth, finite-speed descent with a stable resting point at the data.1 Because the transport correction dies wherever the posterior concentrates (high dimension, or near the manifold), what is left is a clean preconditioned gradient flow.
See it at every point on the plane. The figure samples a grid of locations and, at each, draws the three terms of equation (14) head to tail: a teal natural-gradient arrow, then an amber transport-correction arrow, then a small drift arrow, with a dashed line showing the total. The data points sit inside amber rings; the chains shrink to zero as you approach them. Slide the posterior sharpness and watch the amber transport correction die away as the posterior concentrates; what is left is the teal natural-gradient flow plus the drift.
So the infinite well is a mirage from the model's point of view. The field it learns is bounded, with the data as a stable attractor, which is what you saw the particles settle into in Figure 1. The target is fine. But following a fine target with a real numerical sampler is where blind models actually live or die.
Which parameterization survives
Keep the two questions apart, because conflating them is the easiest mistake here. The last section showed the learned target is bounded. This section is about the sampling dynamics: you do not get the field for free, you integrate an ODE with it, and a bounded target divided by a vanishing noise scale can still produce a stiff, explosive equation. The sampler integrates
where is the schedule's drift and is the effective gain of the parameterization. Note the subtlety the paper is careful about: even though the network has no clock, the sampler does, because and depend on . The architecture is autonomous; the integration schedule around it is not.
To judge stability, compare the blind sampler against an oracle that knows the true at every step. Subtracting the two cancels the shared drift term and leaves the error introduced purely by being blind:
Stability is a race as . The estimation error falls (the posterior over the noise level concentrates, from the last section), but the gain may diverge. Whoever wins decides everything, and the three standard targets land in three different places.
Noise prediction (DDPM, DDIM) has gain , which diverges. Worse, its estimation error does not fall to zero. It floors at a positive "Jensen gap": because is convex, the posterior-average of exceeds , and that gap stays nonzero unless the posterior is a perfect spike. A diverging gain times a floored error goes to infinity. Blind noise prediction blows up.
Signal prediction (EDM) has an even worse gain, . But near a discrete data manifold the denoising error decays exponentially, like . Near a data point the denoiser is a softmax over the data points weighted by a Gaussian in distance, so once you sit close to one point every other point is suppressed by a Gaussian factor in its squared distance, and the error to the nearest point dies that fast. Exponential decay beats any polynomial blow-up, so the product goes to zero: a stronger gain singularity, and still stable, because the error crashes faster than the gain climbs.
Velocity prediction (flow matching) has gain , flat. There is nothing to amplify, so the error stays bounded and the dynamics are inherently stable. Toggle the target below and watch the drift error as the noise vanishes: only noise prediction runs to infinity.
That is the title's claim, made precise. You can drop the noise-level input, but only if your parameterization has a bounded gain (velocity) or a self-correcting one (signal). Velocity-based models are inherently safe; noise-prediction models are structurally broken when blind. The difference is one coefficient, : bounded for velocity, divergent for noise.
What this actually says
On the CIFAR-10 numbers Sun et al. report, run blind, a noise-prediction DDIM collapses to FID , while blind flow matching and uEDM (both velocity) reach and . FID is a distance between the real and generated image distributions where lower is better, and on CIFAR-10 anything around is near the best published, while means the samples are visibly broken. That roughly eighteen-fold gap is not a tuning artifact. It is the gain singularity turning ordinary estimation noise into garbage. The paper's own runs on CIFAR-10, SVHN, and Fashion-MNIST show the same pattern by eye: blind noise-prediction images are dominated by high-frequency artifacts, while blind velocity images are sharp and match the conditioned models.
The dimensionality experiment makes the geometry visible. Take a 2D ring of data, embed it in a -dimensional space, and sweep . At both blind models struggle: the shells overlap, the posterior is ambiguous, and the samples are diffuse. At and velocity is already clean while noise prediction is scattered. At the concentration is so sharp that even the unstable noise-prediction model converges, because its estimation error finally crashes faster than its gain diverges. The blessing of dimensionality, start to finish.
So do diffusion models need noise conditioning? The short answer is no. The precise answer is that a velocity- or signal-based model can drop the explicit noise input, because the geometry supplies the level and the dynamics stay bounded, while a noise-prediction model effectively still needs it, because run blind it is structurally unstable. The clean argument leans on data living on a low-dimensional manifold inside a high-dimensional space, and the exponential-stability case for signal prediction assumes a discrete manifold, so the result is sharpest exactly where real image data tends to live.
Step back and the whole thing is four facts. A blind model is the posterior-weighted average of the models that knew the noise level. High-dimensional geometry makes that posterior sharp, so the average is accurate. The landscape it descends has an infinite singularity at the data, but a vanishing gain preconditions it into a stable attractor. And whether the sampler survives the descent comes down to one gain coefficient. Put together, the clock that every diffusion model carries was, for the stable parameterizations, information the geometry already held. The model was never really using the clock. It was reading the time off the noise.
Questions you might still have
If the network never sees the noise level, how does it know how much to denoise?
In high dimensions the noise level is written into the geometry of the observation: the distance from the point to the data manifold pins it down (σ̂ = r/√(D−d)), so the posterior over the noise level collapses to a near-certain estimate. The model reads the clock off the picture.
Then why does a blind DDPM fail while blind flow matching works?
Stability is a race as the noise vanishes. Noise prediction has a gain that grows like 1/b and amplifies the residual uncertainty (a nonzero "Jensen gap") into a blow-up; velocity prediction has a gain of exactly 1, so nothing amplifies the error. Same blindness, opposite fate.
Does this mean noise conditioning is useless?
No. The honest claim is narrower: a velocity- or signal-based model can drop the explicit noise input because the geometry supplies it and the dynamics stay bounded. A noise-prediction model cannot. And the clean theory leans on data sitting on a low-dimensional manifold inside a high-dimensional space.
What is the marginal energy, exactly?
It is −log p(u), where p(u) averages the noisy-data densities over every noise level. It is the single static landscape a blind model descends, and it has an infinitely deep well at every data point.
Footnotes & further reading
- The gain acts as a conformal metric, a scalar function times the identity that locally rescales distance. One bookkeeping subtlety: if the update is written as gain times gradient, the implied metric is the inverse gain, so "the gain is the metric" is right in spirit but worth pinning down in the algebra. ↩
- The paper: Sahraee-Ardakan, Delbracio, Milanfar, The Geometry of Noise: Why Diffusion Models Don't Need Noise Conditioning (Google, 2026).
- The empirical result this explains: Sun, Jiang, Zhao, He, Is Noise Conditioning Necessary for Denoising Generative Models? (2025), which also supplies the unified affine schedule and the CIFAR-10 benchmark.
- The diffusion design space and the EDM conventions: Karras et al., Elucidating the Design Space of Diffusion-Based Generative Models. Flow Matching: Lipman et al., Flow Matching for Generative Modeling.
- Tweedie's formula and the denoiser-score link: Robbins (1956), Efron (2011), and Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011).
- The concurrent, kindred analysis of blind denoising in high dimensions: Kadkhodaie et al., Blind Denoising Diffusion Models and the Blessings of Dimensionality. Natural gradient: Amari (1998).
How could this explainer be improved? Found an error, or something unclear? I read every message.