Diffusion · Architecture

DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation

A residual network is already a diffusion model in disguise.

Residual connections are, mathematically, the steps of a diffusion model's denoising process. A few ideas turn that into the complete method, at a fraction of the training memory.

Explaining the paperDiffusionBlocks: Block-wise Neural Network Training via Diffusion InterpretationShing, Koyama, Akiba · Sakana AI · ICLR 2026 · arXiv:2506.14202 ↗

Train a deep network one slice at a time, holding a fraction of it in memory, with no backpropagation running end to end.

The last decade of AI has been shaped by one constraint: to train a network with backpropagation, you have to remember everything. Every layer's output, every intermediate activation, all of it kept alive in memory from the forward pass so the backward pass can use it to compute gradients. Train a 100-layer model and you pay for 100 layers' worth of activations at once. Memory grows linearly with depth, and that wall, not arithmetic and not data, often sets the largest model you can fit on your GPU.

People have tried to dodge this for years by chopping the network into pieces and training each piece on its own. If you only ever hold one piece in memory, depth stops being a memory problem. But nobody could make it work well. Each piece was trained without information about what the other pieces computed, so researchers made up local objectives, little stand-in goals for each block, and those stand-ins were always a bit arbitrary, and the resulting networks always came out worse than just accepting the cost of training end-to-end. Block-wise training has been one of those ideas that's obviously a good idea and stubbornly refuses to pay off.

DiffusionBlocks, out of Sakana AI, is the best answer I've seen. Its claim is almost too tidy: the residual connections already sitting in every modern Transformer are, mathematically, the steps of a diffusion model's denoising process. And a diffusion model has a property nobody else's local objective had. Its training objective splits cleanly across noise levels. So each piece really can be trained on its own, against a principled target, and the pieces still add up to one network at the end.

The method rests on four ideas: what a diffusion model actually is, what the "score" is and why a denoiser computes it, how generation is an ODE you solve with Euler's method, and why a residual layer is the same shape as one Euler step. Each is straightforward alone; together they explain the method.

The data lives on a thin sheet

The rest of this post runs on one picture. All the "real" things a model could produce, every photograph of a face, say, can be placed as points in an enormous space. A modest image has hundreds of thousands of pixels, so each image is a point in a space of hundreds of thousands of dimensions. The crucial intuition is that real images occupy almost none of it. Scramble the pixels randomly and you get static, essentially never a face. The faces sit on a thin, crumpled, lower-dimensional sheet, the data manifold (a manifold just means a lower-dimensional surface that looks like ordinary flat space up close, even though it curves and folds when you stand back), floating in a vast emptiness.

Generation is the problem of landing on that sheet. You start somewhere random in the void and you need a way to walk back onto the manifold. Diffusion models learn that walk by first studying its reverse: how the sheet dissolves into the void when you add noise.

The noising is as simple as it sounds. Take a clean point $\mathbf{y}$ and add Gaussian noise of standard deviation $\sigma$ :

\mathbf{z}_\sigma = \mathbf{y} + \sigma\boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})

(1)

That is the entire forward process in the variance-exploding (VE) convention this paper uses (the one Karras and collaborators settled on in their EDM paper). The name deserves unpacking, because it flags a convention you have to keep straight. The signal $\mathbf{y}$ is never touched. It isn't shrunk or rescaled, it just sits there while a bigger and bigger cloud of noise is piled on top. The variance of the noise is what explodes as $\sigma$ grows. (The other popular convention, the one from the original DDPM (denoising diffusion probabilistic models) line of work, instead shrinks the signal by $\sqrt{\alpha}$ as it adds noise, keeping the total variance pinned near one. Same family of ideas, different bookkeeping. The VE choice matters here because it's what keeps the algebra tractable later, when the residual connection emerges from the Euler step.)

So $\sigma$ is the single dial from clean data to pure noise. At $\sigma \approx 0$ you have clean data; crank $\sigma$ up and structure washes out until you're left with a featureless Gaussian blob, pure noise, the same blob no matter what image you started from. Every diffusion model begins generation from that blob, and has to end up on the manifold.

Drag the slider and watch a spiral, our stand-in for "structured data," come apart:

Figure 1 · forward noising

noise σσ = 0.60

Each clean point y is pushed off the data manifold by Gaussian noise. Drag σ up and watch the spiral dissolve into a featureless blob. That blob is what every diffusion model starts from and must climb back out of.

The spiral is gone by the time $\sigma$ is large, by design. The forward direction destroys information; it's easy and requires no learning at all. All the intelligence lives in running it backward. The sheet matters later in the method: at every noise level, denoising pulls points back toward the same low-dimensional sheet, so a block responsible for only its own band of noise still inherits a consistent global target, and that shared manifold keeps independently trained blocks compatible.

The score: a vector field pointing toward data

To walk back toward the manifold, you'd love to have, at every point in the space, an arrow telling you which way the data is. That arrow exists, and it has a name.

Write $p_\sigma$ for the distribution of the noised data $\mathbf{z}_\sigma$ at level $\sigma$ , the cloud you just watched the spiral dissolve into. The arrow we want is the gradient of its log-density, $\nabla_{\mathbf{z}} \log p_\sigma(\mathbf{z})$ , and it's called the score. (Specifically the Stein score, the gradient with respect to the point in data space, not with respect to any model parameters. Keep that straight; "score" is an overloaded word.) At any location $\mathbf{z}$ , it points in the direction in which the noised density increases fastest.

Two things make the score lovely to work with. First, it's invariant to normalizing constants. A probability density has an annoying $1/Z$ out front to make it integrate to one, and $Z$ is usually a hopeless integral over the entire space. But the gradient of the log kills any constant: $\nabla \log(p/Z) = \nabla \log p - \nabla \log Z$ , and that second term is zero.

Second, the score behaves exactly the way intuition demands as you change the noise level. At small $\sigma$ the density is sharp, concentrated tightly around the data, so the field points sharply toward the nearest clump. At large $\sigma$ everything has blurred together into one broad hill, and the field points gently toward the global center of mass. Play with $\sigma$ and watch the field stiffen and relax:

Figure 2 · the score field

noise σσ = 1.00

The score ∇log p_σ(z) is a vector field: at every point it aims where probability mass is densest. At small σ it points toward the nearest data cluster; at large σ the clusters blur together and it points gently to the global center.

If you had this vector field, generation would be easy: drop a particle in the noise and follow the score field down to the data. So the entire game reduces to estimating the score. Which sounds hard, until you realize you already know how to do it, under a different name.

Tweedie's formula: a denoiser is a score estimator

Suppose I hand you a noisy point $\mathbf{z}_\sigma$ and ask for your single best guess of the clean $\mathbf{y}$ it came from. "Best" in the least-squares sense means you should return the average of every clean point that could plausibly have been noised into this $\mathbf{z}_\sigma$ , the conditional mean $\mathbb{E}[\mathbf{y}\mid\mathbf{z}_\sigma]$ . Call that best guess $D(\mathbf{z}_\sigma,\sigma)$ ; it's a denoiser.

A result from the 1950s called Tweedie's formula says something unexpectedly convenient: for Gaussian noise, that best-guess denoiser and the score are the same arrow, up to a constant. It is a strange equivalence. Denoising asks a supervised question, where is the clean data that produced this point, while the score asks an unsupervised one, which way is the blurred density increasing, and for Gaussian noise the two answers point the same way, so training one gives you the other. Figure 3 below lets you check it by hand: the two arrows are computed independently, and they coincide at every probe and every $\sigma$ .

\nabla_{\mathbf{z}} \log p_\sigma(\mathbf{z}) = \frac{D(\mathbf{z},\sigma) - \mathbf{z}}{\sigma^2}

(2)

The left side is the score, the abstract, normalizer-free score that is the core of the method. The right side is "your best guess of the clean image, minus where you currently are." Of course that difference points from your noisy point toward the clean data; dividing by $\sigma^2$ sets the length. The direction of increasing plausibility is the direction from noisy toward denoised. Estimating the score and denoising an image are literally the same task.

The derivation is three lines and you know every move in it. The noised density is the data blurred by a Gaussian, $p_\sigma(\mathbf{z}) = \int \mathcal{N}(\mathbf{z};\mathbf{y},\sigma^2\mathbf{I})\,p(\mathbf{y})\,d\mathbf{y}$ . Differentiate in $\mathbf{z}$ (the only place $\mathbf{z}$ appears is inside the Gaussian), divide by $p_\sigma(\mathbf{z})$ , and the integral collapses into a posterior average:

\nabla_{\mathbf{z}} \mathcal{N}(\mathbf{z};\mathbf{y},\sigma^2\mathbf{I}) = \mathcal{N}(\mathbf{z};\mathbf{y},\sigma^2\mathbf{I})\,\frac{\mathbf{y}-\mathbf{z}}{\sigma^2} \quad\Longrightarrow\quad \nabla_{\mathbf{z}} \log p_\sigma(\mathbf{z}) = \frac{\mathbb{E}[\mathbf{y}\mid\mathbf{z}] - \mathbf{z}}{\sigma^2}

That posterior mean $\mathbb{E}[\mathbf{y}\mid\mathbf{z}]$ is exactly the best-guess denoiser $D$ , so you've recovered eq (2) with nothing fancier than the chain rule. The Gaussian produces the clean $(\mathbf{y}-\mathbf{z})/\sigma^2$ factor; other noise families have their own Tweedie identity, but they lose this exact form.

You can check eq (2) by hand on a toy density, so check it everywhere:

Figure 3 · the denoiser is the score

noise σσ = 0.80

A three-mode Gaussian mixture, blurred by σ (the amber heat is p_σ). Drag the probe z anywhere. The wide amber arrow is (D(z,σ) − z)/σ² with D the exact posterior mean; the thin teal arrow is the score ∇log p_σ(z). Both are computed independently from the same closed-form mixture, and they coincide at every point and every σ, cosine 1.000: that is Tweedie's formula.

And this matters, because one of these tasks is hard to set up and one is trivial. You cannot directly supervise "the score," since you don't have ground-truth arrows. But you can absolutely supervise a denoiser: take clean data, add noise yourself (so you know the answer), and train a network to undo it. That's a plain regression problem:

\mathcal{L}(\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{y},\,\sigma,\,\boldsymbol{\epsilon}}\Big[\, w(\sigma)\,\big\lVert D_{\boldsymbol{\theta}}(\mathbf{y}+\sigma\boldsymbol{\epsilon},\,\sigma) - \mathbf{y} \big\rVert_2^2 \,\Big]

(3)

Show the network a noised sample, ask it to predict the clean original, penalize the squared error. The weight $w(\sigma)$ balances how much each noise level counts so no single $\sigma$ dominates the gradient. Train this and, by eq (2), you've trained a score estimator at the same time.

Tweedie's formula is an exact theorem about the ideal denoiser $\mathbb{E}[\mathbf{y}\mid\mathbf{z}]$ ; the " $\approx$ " you'll see in the paper is only because a real trained network $D_{\boldsymbol{\theta}}$ approximates that ideal, not because the math is loose. And notice $D_{\boldsymbol{\theta}}$ predicts the clean data $\mathbf{y}$ , not the noise $\boldsymbol{\epsilon}$ the way vanilla DDPM does. These are interchangeable reparameterizations, but it pays to know which one you're holding. The formula implies one more thing: because the denoiser returns the mean over all candidate clean images, its output at high noise looks blurry, an average of many faces rather than a face. That's not a bug; a conditional expectation is supposed to look like that.

Generation is an ODE you solve with Euler's method

We have the score field. Now we need to integrate along it.

The precise way to state the walk is as a differential equation. There's a famous result (Song and collaborators, in the paper that unified diffusion models under stochastic differential equations) that the noising process can be reversed in two equivalent ways: a noisy, random one (an SDE) and a smooth, deterministic one that produces the exact same distribution of outcomes at every noise level. The deterministic version is called the probability-flow ODE, and for the VE convention it's strikingly simple:

\frac{d\mathbf{z}_\sigma}{d\sigma} = -\sigma\,\nabla_{\mathbf{z}} \log p_\sigma(\mathbf{z}_\sigma)

(4)

An ODE like this is a rule that says "at this position and this noise level, here is the velocity." A vector field. To generate, you start at the top, a sample of pure noise at $\sigma_{\max}$ , and integrate this velocity downward as $\sigma$ shrinks toward zero, and you land on the manifold. (The deterministic ODE and the random SDE share the same marginal distributions at every $\sigma$ . That means if you ran either process many times and looked at the cloud of points at a given $\sigma$ , you'd see the identical distribution, even though any single deterministic run traces a different path than any single random one.)

That shared-marginals fact has a reason that takes a sentence to state. As $\sigma$ grows, the VE process is points doing a driftless random walk, so the density $p_\sigma$ obeys a diffusion (heat) equation. Any such spreading can be rewritten as a continuity equation, the accounting a physicist uses for a fluid that is neither created nor destroyed: the density changes only because probability flows somewhere, at some velocity. Solve for the velocity that reproduces the heat equation's spreading and it comes out to exactly $-\sigma\,\nabla_{\mathbf{z}}\log p_\sigma$ , which is eq (4). The random walk and the deterministic flow are two ways to shove the same density around, which is why they agree at every noise level. The minus sign and the $\sigma$ aren't free knobs; they're what makes the two descriptions match.

Now substitute Tweedie's formula (2) into the ODE (4) to get rid of the abstract score and put our trainable denoiser front and center:

\frac{d\mathbf{z}_\sigma}{d\sigma} = -\sigma \cdot \frac{D - \mathbf{z}}{\sigma^2} = \frac{\mathbf{z} - D}{\sigma}

To actually solve an ODE on a computer you take small steps. Euler's method, the most basic method in numerical analysis. From a point $\mathbf{z}$ at noise level $\sigma_{\text{prev}}$ , you step to the next, lower level $\sigma_{\text{next}}$ by moving along the velocity for the length of the step:

\mathbf{z}_{\text{next}} = \mathbf{z} + (\sigma_{\text{next}} - \sigma_{\text{prev}})\cdot\frac{\mathbf{z}-D}{\sigma_{\text{prev}}}

The paper itself writes the displacement in the opposite order. There are two minus signs to track. The ODE (4) has a minus in front, so you head downhill in noise (the minus isn't reversing the score direction; it's there because decreasing $\sigma$ is the direction of generation). And you're walking the $\sigma$ -axis backward, from large $\sigma$ down to small, so your step $(\sigma_{\text{next}} - \sigma_{\text{prev}})$ is itself negative. Two reversals compose into forward progress. Let $\Delta\sigma = \sigma_{\text{prev}} - \sigma_{\text{next}} > 0$ be the positive size of the step, and the expression collapses to something you can read at a glance:

\mathbf{z}_{\text{next}} = \mathbf{z} + \frac{\Delta\sigma}{\sigma_{\text{prev}}}\big(D - \mathbf{z}\big)

(5)

In words: take a step a fraction of the way from where you are toward where the denoiser predicts the clean data to be. If your guess says the answer is at $D=2$ and you're sitting at $\mathbf{z}=5$ , you move to something like $4.25$ , closer to $2$ and never further. The denoiser determines the target direction; the noise schedule determines the step length.

Euler is a first-order method. Its error per step shrinks like $\Delta\sigma^2$ , and the accumulated error over the full trajectory like $\Delta\sigma$ . Take coarser steps and you track the curve less accurately; take more steps and you track it better, which is exactly why fancier samplers add a second-order correction. That trade, more steps for a better approximation, also governs how many Euler steps each block takes.

Here is the reverse process running. Pure noise at the top, thirty-two Euler steps down the probability-flow ODE, and the particles settle onto the data:

Figure 4 · reverse generation

32 Euler steps · σ: 4 → 0

Generation walks the noise level down. Start from pure noise at σ=4, then take 32 Euler steps down the probability-flow ODE, nudging each particle along the score until it settles onto a data cluster. Every step is one residual update.

The arrows over a full run answer a question you might be sitting on: why does a particle keep changing direction on the way down? It falls straight out of the velocity, $(\mathbf{z}-D)/\sigma$ , and the fact that the denoiser $D$ is itself a function of $\sigma$ .

At high $\sigma$ the denoiser's guess is broad. With that much noise on its input, $D$ cannot distinguish which data point you came from, so its best guess is close to the average of all of them: one blurry blob sitting at the center of the data. Every particle, wherever it starts, gets pulled toward that same place, so early in the run the field is smooth and slow and barely varies from point to point. The big decisions haven't been made yet.

As $\sigma$ shrinks, $D$ sharpens. Now the noise is small enough that the nearest data point dominates the guess, so $D$ no longer outputs the global average and instead outputs which mode you actually belong to. The field reorganizes from "everything flows to the center" into "each region flows to its own data point," with sharp ridges along the borders between basins. A particle that was drifting toward the middle is now pulled toward a specific cluster. That is the direction change you see. It isn't randomness; it's the denoiser's output shifting as the falling noise level lets it resolve finer structure.

The same fact explains the other thing you notice, that the arrows are short out in the open and turn sharply near the data. Two effects stack. The $1/\sigma$ out front scales the field up as $\sigma \to 0$ . And near a data point, nudging $\mathbf{z}$ a little swings the direction toward it a lot (and crossing the border between two points flips $D$ from one to the other), so the field turns fastest exactly where the data is densest. Far from any data there's nothing to resolve, and the arrows just point home.

The compact way to say all of it: $p_\sigma$ is the data distribution blurred by a Gaussian of width $\sigma$ , and the score reflects only structure coarser than $\sigma$ . Large $\sigma$ blurs everything into a single hill, so the field is coarse and smooth. Small $\sigma$ resolves the individual data points, so the field grows sharp features right where they sit. This is why diffusion paints coarse-to-fine: the layout first, the detail last. It also explains why the middle noise levels, the ones reorganizing the field from "one center" into "specific modes," carry most of the work.

Every one of those steps is a single application of (5): state, plus a correction toward the denoiser's guess. That shape, state + correction, is exactly the form of a residual block.

A residual block is one Euler step

If you've looked at the guts of a ResNet or a Transformer, you know the residual connection. A block doesn't replace its input; it computes a correction and adds it:

\mathbf{z}_\ell = \mathbf{z}_{\ell-1} + f_{\theta_\ell}(\mathbf{z}_{\ell-1})

(6)

This single design decision is most of why deep networks train at all. It's the idea behind ResNets, and it's in every Transformer block ever shipped.

Unpacking the symbols: $\mathbf{z}_{\ell-1}$ is the activation flowing into layer $\ell$ (for a Transformer, the full sequence of token vectors, a tensor of shape $\text{tokens} \times d$ ), $\mathbf{z}_\ell$ is what flows out to the next layer, and $f_{\theta_\ell}$ is that layer's own learned transformation (its attention and MLP, with weights $\theta_\ell$ ). The residual connection says layer $\ell$ doesn't build $\mathbf{z}_\ell$ from scratch. It computes a correction $f_{\theta_\ell}(\mathbf{z}_{\ell-1})$ and adds it to the input $\mathbf{z}_{\ell-1}$ it received from the previous layer, so each layer only has to learn how to adjust the running state, not regenerate it. Now put it next to the Euler step (5):

\underbrace{\mathbf{z}_\ell = \mathbf{z}_{\ell-1} + f_{\theta_\ell}(\mathbf{z}_{\ell-1})}_{\text{a residual layer}} \quad\Longleftrightarrow\quad \underbrace{\mathbf{z}_{\text{next}} = \mathbf{z} + \tfrac{\Delta\sigma}{\sigma}\big(D - \mathbf{z}\big)}_{\text{one Euler step}}

They are the same shape. "Old state plus a learned correction" is "old state plus a velocity step." This is the observation behind Neural ODEs: a residual network is Euler's method applied to some hidden ODE, with the step size baked in to one, and the network's depth playing the role of time. Each layer advances you one tick along a continuous trajectory; the function $f_\theta$ approximates that trajectory's velocity field.

The discretization is visible below. The smooth amber curve is a true continuous-depth ODE flow. The teal path is a residual network trying to follow it with a handful of discrete layers, each dot one layer, one Euler step. Few layers and it overshoots and cuts the corners; add layers (shrinking the step) and the staircase melts into the curve:

Figure 5 · depth is time

layers LL = 6 · h = 0.50

The true flow (the smooth continuous-depth ODE) is fixed. The residual network approximates it with L discrete jumps, one dot per layer. A few layers overshoots and cuts corners; add layers (shrink the step) and the staircase melts into the curve. Depth is time, and a layer is one Euler step.

The match is structural, not exact. A vanilla residual block is a crude, step-size-one Euler step, and an off-the-shelf ResNet won't converge to a clean Neural ODE without some care. But the shape is real, and the shape is all DiffusionBlocks needs.

There is still a coefficient that doesn't match. A vanilla residual block adds its correction with weight one; the exact Euler step (5) adds it with weight $\Delta\sigma/\sigma_{\text{prev}}$ , which differs for every block and shrinks as you approach clean data. DiffusionBlocks doesn't make the network learn to fight that factor. The block predicts only the denoised guess $D$ , a plain regression target, and the known noise schedule supplies the $\Delta\sigma/\sigma$ multiplier when the Euler update is assembled around it. The learned part is only the denoising; the geometry of the step is fixed by the known noise schedule.

Now for the question the paper is built on: what if we deliberately made the residual blocks into denoising Euler steps, and trained each one with the diffusion objective?

The synthesis: every block becomes its own denoiser

It comes down to one reinterpretation of the stack of residual blocks we already have: the network isn't computing some opaque feedforward function, it's running the reverse diffusion ODE, and each block is responsible for one stretch of the journey from noise to data.

The recipe is three steps, and none of them is heavy.

One, cut the stack into blocks. Take your $L$ layers and group them into $B$ contiguous blocks. (A "block" can be a single layer or a dozen; in the paper's experiments it's typically a few Transformer layers.)

Two, hand each block a noise range. Slice the range of noise levels $[\sigma_{\min},\sigma_{\max}]$ into $B$ consecutive intervals and give one to each block. The block that owns the high-noise end is doing coarse, broad-strokes work; the block that owns the near-clean end is doing fine detail. (How to cut the intervals fairly is its own neat problem.)

Three, tell each block it's a denoiser. Feed the block a noised input and condition it on the noise level $\sigma$ , meaning you feed $\sigma$ in as an extra input so the same block can adapt its behavior across its whole noise band. (The standard mechanism, AdaLN, nudges each layer's normalization based on $\sigma$ ; it's the same conditioning DiT (the Diffusion Transformer, a Transformer that replaces the U-Net as the backbone of a diffusion image model) uses.) The block now plays the role of $D$ in the Euler update (5), and its job is to predict the clean target $\mathbf{y}$ from a noisy input within its assigned noise range.

With that conditioning, the ODE reading stops being a metaphor. A plain residual layer is not told any noise level; the same $f_\theta$ runs no matter where in the stack you are. But the velocity $(\mathbf{z}-D)/\sigma$ depends explicitly on $\sigma$ , so a block standing in for it has to be told its noise level. Feeding in $\sigma$ is what turns a fixed layer into a time-dependent velocity field $v(\mathbf{z},\sigma)$ , a real right-hand side for the differential equation rather than just a statement about depth.

The third step removes the coupling between blocks. The training loss (3) is an expectation over noise levels, and the denoising problem at $\sigma = 5$ has nothing to do with the problem at $\sigma = 0.1$ . In ordinary diffusion those per- $\sigma$ problems still fight over one shared set of weights $\boldsymbol{\theta}$ . DiffusionBlocks removes even that coupling: give each block its own parameters and a disjoint slice of $\sigma$ , and now nothing links them, not the target and not the weights. Each block has a complete, self-contained objective:

\mathcal{L}_b(\boldsymbol{\theta}_b) = \mathbb{E}_{(\mathbf{x},\mathbf{y}),\,\sigma\sim p^{(b)},\,\boldsymbol{\epsilon}}\Big[\, w(\sigma)\,\big\lVert \bar{f}_{\boldsymbol{\theta}_b\mid\sigma}(\mathbf{x},\,\mathbf{y}+\sigma\boldsymbol{\epsilon}) - \mathbf{y} \big\rVert^2 \,\Big]

(7)

Here $\mathbf{x}$ is whatever the network is conditioned on (the class label, the text prompt, nothing at all for plain generation), and $p^{(b)}$ is the overall noise distribution restricted and renormalized to block $b$ 's slice. To train block $b$ , you take clean data, add noise at a level drawn from its range, ask it to denoise, and backpropagate through that one block only. No previous block's output is needed. No gradient flows between blocks. No backpropagation through the entire stack.

And the obvious worry, the one a careful reader feels immediately: if block $b$ 's training never includes block $b+1$ 's output, how do the handoffs line up at the end? No block is ever trained against another block's output. Every block is trained against the same fixed ground truth, real data with fresh noise added at its own $\sigma$ . The reverse ODE only ever needs the correct denoiser at each noise level, and that target is identical whether one network learns all the levels or $B$ separate blocks split them up. Get each band right on its own and the global trajectory is automatically right. This is why earlier block-wise methods failed and this one succeeds: those methods invented a local objective and hoped the blocks would cohere, whereas here the local objective is derived. It's the denoising-score-matching loss for that block's noise band, and the diffusion theory guarantees that consistent local denoising rebuilds the global reverse process.

Training and running differ in one way that matters here. In training, no block's training uses another block's output; each gets fresh-noised real data at its own $\sigma$ . At generation the blocks run in sequence, and block $b+1$ consumes whatever block $b$ actually produced, imperfections and all. So if a block denoises imperfectly its output drifts off $p_\sigma$ , and that error feeds downstream. This is the diffusion cousin of exposure bias: train on truth, run on your own approximations. It's real, and it isn't peculiar to splitting the network. A single shared diffusion model has the identical problem, since it too is fed its own previous output while sampling, and it copes because each step's error is small (Euler's per-step error shrinks like $\Delta\sigma^2$ ) and the velocity field keeps contracting toward the data over most of the trip. Block-wise training adds no new source of mismatch, since the learning target never references another block. The paper does hedge the seam in one concrete way: it lets adjacent blocks overlap a little, training each on a noise range stretched past its own boundaries on each side by a small per-block factor (about 10% for the narrow middle blocks, more for the wide tails), so every block has already seen inputs from just outside its slice before it has to accept a handoff there.

What you buy is memory. You only ever hold one block's activations live, so training memory drops from $L$ layers to $L/B$ , a $B$ -fold reduction, and the model's parameter count doesn't change at all. The paper measures exactly that: a $3\times$ reduction for its $B{=}3$ diffusion model, $4\times$ for the $B{=}4$ autoregressive one. (Gradient checkpointing, the usual trick for trading compute to save activation memory, helps the constant factor but doesn't remove the coupling between blocks the way this does.) Both panels below train the same twelve-layer network; the left does it end-to-end with the memory meter pinned at 100%, the right does it block-wise with the meter near $1/B$ :

Figure 6 · training memory

Both train the same 12-layer network. End-to-end stores every layer's activations, then backprops through all of them, so the meter pins at 100%. DiffusionBlocks trains one block at a time on its own noise range: only that block's activations live in memory, so the meter sits near 1/B.

What about generation, once everything is trained? It is the forward pass you already know. You feed a noisy input in at the top and the blocks run in order. Each applies its Euler update over its own $\sigma$ -band, state plus a correction toward its denoised guess, then hands the result down to the next. The bottom of the stack is your sample. The network runs exactly like any residual network at inference; we changed how it's trained, not how it runs. (For a plain diffusion model you can be lazier still and call only the one block whose noise band a given denoising step falls in.)

A word on terminology, because three different things have all been called a "step." One residual layer is one Euler step of the hidden ODE. A block is several layers, so a block is a short run, not a single step. And the reverse demo earlier took thirty-two steps. So how do three or four blocks cover a fine trajectory? They don't have to: the number of Euler steps is a separate dial from the number of blocks. By default the paper takes one step per block, the coarse setting, but you can take more, and at each step the sampler just calls whichever block owns the current $\sigma$ and re-conditions it. Block count is a memory decision; step count is an accuracy decision; the two are independent, and more steps still track the true curve better.

What it looks like in practice

By now an objection has been building: the diffusion story put $\mathbf{z}$ in data space, the same shape as the image, but the state running between Transformer blocks is a hidden activation, nothing like a picture. How can a residual block denoise an image that is never fed to it? It doesn't. DiffusionBlocks runs the diffusion in the model's hidden space, not in pixel space. A shared read-in lifts the noised target up to hidden width once at the top, and a shared read-out maps the final hidden state back to whatever the task needs. Both sit outside the $B$ blocks and are the only pieces every block shares. In between, the $\mathbf{z}$ each block corrects is a hidden vector, and the clean target is the hidden representation of the answer. The dimension that must match, block in to block out, is the hidden width, which is exactly why a residual stack qualifies and a U-Net that resizes partway does not.

This is also how the framework swallows a classifier, which has no image to denoise. Take the vision Transformer. The thing being denoised isn't the picture, it's the label: the clean target $\mathbf{y}$ is the (normalized) embedding of the correct class, the image is fed in as the conditioning $\mathbf{x}$ , the noised label-embedding rides through the blocks as an extra token, and the read-out turns the denoised embedding into class logits. The block is doing ordinary denoising regression on a label embedding while looking at the image. The next-token Llama works the same way: the noise lives in the continuous embedding space the model already has, never on the discrete token ids, and the read-out maps the denoised embedding back to a token. You're never adding Gaussian noise to a word or a pixel; you're denoising a continuous representation and decoding it at the end.

Let me make all of that concrete. Say the base network is a 12-layer Transformer over a sequence of $n$ tokens, each a $d$ -dimensional vector, so an activation is a tensor of shape $n \times d$ . Pick $B = 4$ , so each block is 3 consecutive layers with its own parameters $\theta_1,\dots,\theta_4$ . Take the noise range $[\sigma_{\min},\sigma_{\max}] = [0.002,\ 80]$ from EDM; the equi-probability boundaries land near $\{0.002,\ 0.13,\ 0.30,\ 0.68,\ 80\}$ , so the blocks own:

block 1 → $\sigma \in [0.68,\ 80]$ (the heavy-noise end; coarse layout),
block 2 → $[0.30,\ 0.68]$ ,
block 3 → $[0.13,\ 0.30]$ ,
block 4 → $[0.002,\ 0.13]$ (nearly clean; fine detail),

and each band carries exactly a quarter of the training mass.

Now one training step, say for block 2. Draw a clean target $\mathbf{y}$ from the data (shape $n \times d$ ) and whatever conditioning $\mathbf{x}$ goes with it. Sample a noise level from block 2's band, say $\sigma = 0.45$ . Build the noised input $\mathbf{z} = \mathbf{y} + 0.45\,\boldsymbol{\epsilon}$ with fresh Gaussian $\boldsymbol{\epsilon}$ (same $n \times d$ shape). Feed $(\mathbf{x}, \mathbf{z})$ into block 2, conditioned on $\sigma = 0.45$ , and it returns a prediction $\hat{\mathbf{y}}$ of the clean target. The loss is a single number, $w(0.45)\,\lVert \hat{\mathbf{y}} - \mathbf{y}\rVert^2$ . Backprop touches only $\theta_2$ (three layers' weights and activations); blocks 1, 3, and 4 are not in the graph at all. Step the optimizer and repeat. The other three blocks are trained the exact same way on their own bands, in any order, even on separate machines:

# one optimizer step for block b (others independent)
x, y  = sample_batch()             # y: clean target [n,d]
sigma = sample_noise_level(band[b])   # block b's band
z     = y + sigma * randn_like(y)     # noised input
y_hat = block[b](x, z, sigma)         # predict clean y
loss  = w(sigma) * mse(y_hat, y)
loss.backward()                       # grad -> block[b] only
opt[b].step()                         # ~L/B layers in memory

(The listing above omits one thing: in practice the bare network is wrapped in EDM's $\sigma$ -dependent preconditioning, input and output scalings $c_\text{in}, c_\text{skip}, c_\text{out}$ plus a transformed $\sigma$ fed to the conditioning, which keeps the regression target unit-scaled across a huge range of noise levels. The plain MSE above is the idea; the preconditioned version is what actually trains well.)

Generation runs the same blocks in sequence, walking the noise level down from $\sigma_{\max}$ to $\sigma_{\min}$ . Each block takes the running state, asks its denoiser where the clean data is, and takes one Euler step of that size toward it:

# generation: one pass down the blocks (high -> low noise)
z = sigma_max * randn(n, d)        # start: pure noise
for b in [1, 2, 3, 4]:             # block 1 = highest noise
    s_prev, s_next = band[b]
    y_hat = block[b](x, z, s_prev) # the guess (= D)
    step  = (s_prev - s_next) / s_prev
    z = z + step * (y_hat - z)     # Euler step (5)
return z

That is the complete system: $B$ small denoisers, each trained alone against real data with fresh noise, chained at inference into one residual network. Same forward pass everyone already runs, a quarter of the training memory.

Cutting the noise range fairly

How do you slice $[\sigma_{\min},\sigma_{\max}]$ into $B$ intervals? The lazy answer is to cut $\sigma$ into equal-width pieces. That's a mistake, and seeing why draws on everything above.

Not all noise levels are equally important, and not all are equally active. During training you don't sample $\sigma$ uniformly. Following EDM, you sample $\log\sigma$ from a normal distribution, specifically $\log\sigma \sim \mathcal{N}(P_{\text{mean}}, P_{\text{std}}^2)$ , so $P_{\text{mean}}$ and $P_{\text{std}}$ are nothing more than the mean and spread of that bell curve. It piles most of the probability mass on the middle noise levels, and there's a reason. The extremes are boring: at very high $\sigma$ everything is basically noise and the best you can do is predict the data mean, and at very low $\sigma$ the input is basically clean and there's little to fix. The interesting decisions, where the broad strokes of an image resolve into actual structure, happen in the middle.

So you want each block to shoulder an equal share of the work, which means an equal share of the probability mass, not an equal slice of the axis. The accounting is training examples: every gradient step draws its $\sigma$ from that bell curve, so a block's share of probability mass is literally its share of the data it will ever see, and equal-width slices would dump nearly all the samples on one block while the rest sit starved of gradient signal, guarding noise levels that are almost never drawn. DiffusionBlocks picks the boundaries so each of the $B$ blocks owns exactly $1/B$ of the distribution. Since $\log\sigma$ is Gaussian, "equal area under the bell curve" has a closed form via the inverse normal CDF $\Phi^{-1}$ , which just reads off the noise level at evenly spaced probabilities $q_b$ :

\sigma_b = \exp\!\Big(P_{\text{mean}} + P_{\text{std}}\,\Phi^{-1}(q_b)\Big), \qquad q_b = q_{\min} + \tfrac{b}{B}\,(q_{\max}-q_{\min})

(8)

The consequence is the opposite of what you might first guess. Because the mass sits in the middle, equal-mass slices are narrow in the dense middle and wide out in the sparse tails. With the standard settings and the $\sigma$ range the paper uses, $B=4$ gives you boundaries near $\{0.002, 0.13, 0.30, 0.68, 80\}$ : the two middle blocks each span only a factor of about $2.3\times$ in $\sigma$ , while the extreme blocks span well over an order of magnitude (here roughly $67\times$ and $118\times$ , the exact figures depending on where you clamp the ends), and yet every block carries exactly a quarter of the mass. So the crowded middle, where the broad strokes resolve into structure, gets several narrow, specialized blocks, and the quiet "barely started" and "almost done" ends get one wide block apiece.

Drag $B$ and watch the equal-area slices fall where they fall. The gray ticks show the naive equal-width-in- $\sigma$ cuts, which jam almost every boundary up at the high end and would leave most blocks fighting over noise nobody cares about:

Figure 7 · equi-probability partition

blocks BB = 4

Noise levels are sampled from a log-normal (most mass at the middle σ). DiffusionBlocks slices it into B regions of equal area (equal probability mass) via the inverse normal CDF. The busy middle gets narrow, specialized blocks; the quiet tails get wide ones. The gray ticks are the naive uniform-in-σ cuts.

The paper calls this equi-probability partitioning, and it's what keeps the blocks balanced instead of overloading a few. On CIFAR-10 it takes the FID (Fréchet Inception Distance, the standard image-generation quality score, where lower is better) from $43.5$ with uniform slicing down to $38.0$ . The same balancing buys something subtler: by handing each block an equally hard, equally data-rich job, you've accidentally built a curriculum, in the curriculum-learning sense that each block faces a task of controlled, balanced difficulty rather than a mix of trivial and impossible cases, which tends to make the optimization smoother.

The boundaries the partition draws are also where one block hands off to the next at generation, and a hard seam there is exactly what would show as an artifact, so the paper does not let the bands butt up cleanly. Each block trains on a noise range stretched about 10% past its own boundaries on each side (the factor grows for the wide tail blocks). A single noised sample shows what that buys. Take the $B=4$ boundaries near $\{0.002,\ 0.13,\ 0.30,\ 0.68,\ 80\}$ and suppose a training draw lands at $\sigma = 0.31$ , a hair above the boundary between block 3 (which owns $[0.13,\ 0.30]$ ) and block 2 (which owns $[0.30,\ 0.68]$ ). Strictly that sample belongs to block 2, and block 2 trains on it as usual. But 0.31 is also inside block 3's ~9% overhang above its own upper edge of 0.30, so block 3 draws and denoises samples at that level too. Both blocks have practiced on noise around 0.31 by the time they ship. At generation, when block 2 finishes its band and the running state arrives at block 3 carrying whatever imperfection block 2 left, block 3 is not meeting that noise level for the first time, it has trained on inputs from just outside its slice, so the handoff lands inside the range block 3 has already trained on and the relay shows no seam. The overlap is a small insurance premium, a few percent of extra training coverage on each side of every boundary, against the one place the independently-trained blocks have to meet.

Block-wise matches end-to-end

Given the history, the main result is the surprising one. Trained block-wise, with gradients flowing through only one block at a time, DiffusionBlocks matches end-to-end backpropagation. Not close-for-a-memory-starved-method. Matches. And it does so across a diverse spread of architectures: vision Transformers for classification, DiT-style models for image generation, masked diffusion language models, and even vanilla autoregressive language models, which were never designed to denoise anything.

That last one deserves attention. An autoregressive Llama-style Transformer predicts the next token; it has no notion of a noise level. DiffusionBlocks converts it anyway (augment the input, add noise conditioning, slice it into blocks, train each as a denoiser) and it works, reaching comparable quality while only ever training three layers at a time, since the model has twelve layers in $B=4$ blocks. The framework doesn't care that the architecture wasn't born for it. As long as there are residual connections, there's an ODE underneath, and an ODE can be sliced.

The clearest demonstration is on recurrent-depth models, networks like Huginn that apply the same block over and over, looping to "think longer." Training those leans on backpropagation through time, and even the affordable version truncates the loop (Huginn backprops through 8 of its 32 iterations), with you still paying for every step you keep. But a loop of $\mathbf{z}_k = \mathbf{z}_{k-1} + f_\theta(\mathbf{z}_{k-1})$ is exactly our state + correction shape; it's already a discretized ODE. DiffusionBlocks trains it with a single forward pass per step instead, roughly a 10× cut in training compute, and comes out ahead on the benchmark. The arithmetic is direct: backprop-through-time has to keep every looped iteration it trains through alive at once and run a backward pass over all of them, while DiffusionBlocks trains each iteration as its own denoiser against the fixed clean target, one forward pass and a local backward with nothing upstream held live. Many cheap independent steps instead of one expensive coupled chain, which is where the order of magnitude comes from.

There is a last result, and the paper says so itself. Sometimes block-wise training doesn't just match end-to-end, it beats it (on ImageNet, a DiT-L gets FID $10.6$ block-wise versus $12.1$ end-to-end). It's the same architecture with the same parameter count, only the training changes, so the gain does not come from a larger model; "matches" and "beats" mean equal or better quality at equal capacity, not that block-wise training rediscovers the weights end-to-end would. Why would chopping a network into independently-trained pieces ever help? The authors don't claim to know; they offer a hypothesis, and I think it's the right one. Equi-probability partitioning hands each block a task of calibrated, balanced difficulty, that curriculum again, and ties each block directly to the clean target through its own denoising objective instead of a long, noisy chain of gradients from the output. That's a different optimization landscape, and apparently sometimes a friendlier one. Whether that intuition becomes a theorem is, as they say, future work.

The limits are real. The trick needs each block's input and output to have matching dimensions, which is less an extra rule than a direct echo of the Euler step itself: $\mathbf{z}_{\text{next}} = \mathbf{z} + \text{correction}$ only typechecks when $\mathbf{z}$ and $\mathbf{z}_{\text{next}}$ live in the same space. A classic U-Net deliberately changes resolution between stages, so its blocks break that identity and it doesn't fit yet. And everything here is trained from scratch, which leaves converting an already-trained large model by fine-tuning as the obvious, tantalizing next step.

Each residual connection is an Euler step of a diffusion ODE whose velocity comes from a denoiser trained by simple regression, independently at each noise level. Taken to its conclusion, a stubborn decade-old problem comes apart: you train a deep network one slice at a time, give each slice a principled target, and pay for only a fraction of the memory. The residual connections in your Transformer have been diffusion steps all along. This paper is the first to train them that way.

Provenance Verified against primary literature

EDM (2022)VE noising convention, log-normal noise schedule, and loss weighting.

Song et al. (2021)Score / SDE / probability-flow ODE unification.

Neural ODEs (2018)Residual networks read as discretized continuous dynamics.

Tweedie / VincentA denoiser is a score estimator (1956 / 2011).

correctionThe paper writes the Euler-step displacement with the opposite sign. We teach the correct sign and explain the discrepancy.

Questions you might still have

Why does a particle keep changing direction on the way down?
The velocity is (z − D)/σ, and the denoiser D is itself a function of σ. As noise falls, D sharpens from "the global average" to "your specific mode," so the field reorganizes mid-flight.

If a block never sees the next block during training, how do the handoffs line up?
No block is trained against another block’s output. Every block is trained against the same fixed ground truth. Get each noise band right on its own and the global trajectory is automatically right.

Why would chopping a network into pieces ever beat end-to-end?
A hypothesis, not a theorem: equi-probability partitioning hands each block a task of balanced difficulty (a curriculum) and ties it directly to the clean target instead of a long, noisy gradient chain.

Footnotes & further reading

The paper: Shing, Koyama, Akiba, DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation (Sakana AI / University of Tokyo, ICLR 2026). Code.
The VE diffusion conventions, the log-normal noise schedule, and the weighting all come from Karras et al., Elucidating the Design Space of Diffusion-Based Generative Models (EDM).
The score / SDE / probability-flow ODE unification: Song et al., Score-Based Generative Modeling through SDEs.
Residual networks as discretized dynamics: Chen et al., Neural Ordinary Differential Equations, and Haber & Ruthotto, Stable Architectures for Deep Neural Networks.
Tweedie's formula and denoising score matching: Vincent, A Connection Between Score Matching and Denoising Autoencoders (2011), building on a 1956 result of Robbins.
The prior block-wise method this clearly outperforms, Hinton's Forward-Forward, and the concurrent, kindred-spirit NoProp, which DiffusionBlocks beats on NoProp's own architecture. The difference is that NoProp stays classification-only, while DiffusionBlocks is general, and is the only method to combine continuous time with block-wise training.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.