Diffusion · Sampling

Denoising Diffusion Implicit Models

Reuse a trained diffusion model, and sample it ten to fifty times faster.

Nothing about training changes. DDIM rebuilds only the sampling process, as a family of samplers that all fit the same trained network, and its deterministic member reaches full quality in a few dozen steps instead of a thousand.

Explaining the paperDenoising Diffusion Implicit ModelsSong, Meng, Ermon · Stanford · ICLR 2021 · arXiv:2010.02502 ↗

Take a diffusion model that already works, and make it sample in twenty steps instead of a thousand, without touching a single weight.

A denoising diffusion probabilistic model (DDPM) makes beautiful images and is miserably slow. To draw one sample it starts from pure noise and denoises in small steps, and it takes about a thousand of them, each one a full pass through a large network, run strictly one after another. The paper puts a number on it: sampling 50,000 tiny $32\times32$ images from a DDPM takes roughly 20 hours on a single 2080 Ti, against under a minute for a GAN. At $256\times256$ the same batch could take close to a thousand hours. A model you have to babysit for a day to see a few images is hard to put in front of anyone.

The obvious fix is to take fewer, bigger steps. It fails for a plain DDPM, and the reason is exactly what DDIM works around. A DDPM's sampler is derived as the reverse of one specific noising process, a Markov chain of a thousand steps, where each step adds a little Gaussian noise and depends only on the step just before it. Reverse that chain and you get a thousand-step denoiser and nothing shorter, because the recipe is tied to that exact chain. Drop most of the steps and the sampler is no longer inverting the process it was trained on.

DDIM keeps the trained network exactly as is and rebuilds the sampler from a different starting point. The opening it exploits is that the training objective never involved the thousand-step chain in the first place. It only ever scored the network on undoing a single noise level at a time, image by image. That one fact opens up a whole family of samplers that fit the same network, and one of them is deterministic and happy to take enormous strides.

A few ideas get us there: the training loss depends only on the per-step marginals; that frees a whole family of noising processes with the same marginals; one member of the family is deterministic; and a deterministic sampler both skips steps cheaply and doubles as an encoder. Each is small on its own.

What a diffusion model already gives us

DDIM inherits its network and its training loss from DDPM, so a quick recap grounds everything that follows. A diffusion model has a fixed forward process that adds Gaussian noise to a clean image $\mathbf{x}_0$ in $T$ steps until nothing is left but static. The one property that makes the math tractable is that you can jump to any noise level in a single shot: the distribution of the noised image $\mathbf{x}_t$ given the clean one has a closed form,

q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\big(\mathbf{x}_t;\ \sqrt{\alpha_t}\,\mathbf{x}_0,\ (1-\alpha_t)\mathbf{I}\big), \qquad \mathbf{x}_t = \sqrt{\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\alpha_t}\,\boldsymbol{\epsilon}

(1)

with $\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ . Read it as a fixed power budget: $\sqrt{\alpha_t}$ turns the signal down and $\sqrt{1-\alpha_t}$ turns the static up, geared so the total variance stays at one. As $t$ grows, $\alpha_t$ falls from near one toward zero and the image dissolves into a standard Gaussian.

One notation warning, and it is the single biggest source of confusion in this paper. Here $\alpha_t$ is the cumulative product of the per-step signal-retention factors, so it is what the original DDPM paper and most tutorials write as $\bar\alpha_t$ (alpha-bar). DDPM's per-step $\alpha_t = 1-\beta_t$ corresponds here to the ratio $\alpha_t/\alpha_{t-1}$ . The paper only spells this out in an appendix, with no footnote near the equations, so a reader arriving from DDPM sees a formula that looks wrong until they catch the redefinition. Every equation below uses this cumulative convention: $\alpha_t$ shrinks from $\alpha_0 = 1$ down toward zero.

The network is trained to look at a noised image and predict the noise that was added to it. Write it $\boldsymbol{\epsilon}_\theta^{(t)}(\mathbf{x}_t)$ . The loss is the plainest thing imaginable, a squared error between the predicted and the actual noise, averaged over clean images, noise levels, and noise draws:

L(\boldsymbol{\epsilon}_\theta) = \sum_{t=1}^{T} \mathbb{E}_{\mathbf{x}_0,\,\boldsymbol{\epsilon}}\Big[\ \big\lVert\, \boldsymbol{\epsilon}_\theta^{(t)}\!\big(\sqrt{\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\alpha_t}\,\boldsymbol{\epsilon}\big) - \boldsymbol{\epsilon}\,\big\rVert_2^2\ \Big]

(2)

(The paper writes this as $L_{\mathbf{1}}$ , where the subscript is the all-ones weight vector, one weight per noise level, all set to one. It does not mean an L1 or absolute-value loss; the error inside is squared. This is the same objective a noise-conditional score network minimizes, which is why score-based and diffusion models are two dialects of one idea.) Predicting the noise is the same as predicting the clean image, because equation (1) is invertible: rearrange it and the network's guess of the clean $\mathbf{x}_0$ is

\hat{\mathbf{x}}_0(\mathbf{x}_t) = \frac{\mathbf{x}_t - \sqrt{1-\alpha_t}\ \boldsymbol{\epsilon}_\theta^{(t)}(\mathbf{x}_t)}{\sqrt{\alpha_t}}

(3)

Predicted noise and predicted image carry the same information, and we will use whichever is convenient. That is all DDIM inherits: a network that denoises at every level, and a loss (2) that trained it. Everything DDIM adds happens after training, in how you turn that network into a sampler.

The loss only sees the marginals

Look hard at the loss (2). Every term takes a clean image, jumps straight to one noise level with equation (1), and penalizes the network's error in undoing that single jump. No term chains two noisy images together, and none involves how $\mathbf{x}_t$ and $\mathbf{x}_{t-1}$ relate along the way. In the language of probability, the loss depends only on the marginals $q(\mathbf{x}_t \mid \mathbf{x}_0)$ , the distribution of a single noised image, never on the joint distribution of the entire noisy sequence.

Everything DDIM does starts from that distinction. Think of grading Polaroids against their subjects: the rubric compares each finished photo to the person in it, and never asks in what order the roll was shot. Many different shooting orders produce the same graded photos. In the same way, many different noising processes, with different step-to-step correlations, can share the exact same per-level marginals (1). The DDPM Markov chain is only one of them. Any other process with the same marginals produces the same loss, so it is served by the same trained network, with no retraining.

So the question becomes concrete: what other noising processes share DDPM's marginals, and do any of them reverse into a faster or nicer sampler? The next section builds exactly that family and finds that one member is deterministic.

A family of forward processes

The paper writes down a family of processes, indexed by a vector $\boldsymbol{\sigma} \ge \mathbf{0}$ , engineered so that every member keeps the marginals (1) exactly. The design is easiest to read backward, as the reverse step: given a clean $\mathbf{x}_0$ and a noised $\mathbf{x}_t$ , the previous, slightly cleaner $\mathbf{x}_{t-1}$ is drawn from

q_{\boldsymbol{\sigma}}(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\!\Big(\sqrt{\alpha_{t-1}}\,\mathbf{x}_0 + \sqrt{1-\alpha_{t-1}-\sigma_t^2}\ \cdot\ \frac{\mathbf{x}_t - \sqrt{\alpha_t}\,\mathbf{x}_0}{\sqrt{1-\alpha_t}},\ \ \sigma_t^2\mathbf{I}\Big)

(4)

The mean has two parts. The first, $\sqrt{\alpha_{t-1}}\,\mathbf{x}_0$ , is the clean image scaled down to the next noise level. The second reuses the noise already present in $\mathbf{x}_t$ : the fraction $(\mathbf{x}_t - \sqrt{\alpha_t}\mathbf{x}_0)/\sqrt{1-\alpha_t}$ is exactly the unit noise that was mixed into $\mathbf{x}_t$ , and we lay some of it back down. The variance is a free knob $\sigma_t^2$ .

The odd-looking coefficient $\sqrt{1-\alpha_{t-1}-\sigma_t^2}$ is the heart of the construction, and it is chosen for one reason: the two sources of spread have to add up to the right total. The noise carried over contributes variance $1-\alpha_{t-1}-\sigma_t^2$ , and the fresh noise contributes $\sigma_t^2$ . They sum by design:

\underbrace{(1-\alpha_{t-1}-\sigma_t^2)}_{\text{carried-over noise}} + \underbrace{\sigma_t^2}_{\text{fresh noise}} = 1-\alpha_{t-1}

and $1-\alpha_{t-1}$ is precisely the variance of the marginal (1) at level $t-1$ . So whatever you set $\sigma_t$ to, the marginal comes out unchanged. (The one restriction is that $\sigma_t^2 \le 1-\alpha_{t-1}$ , so the square root stays real. It is not printed as a rule in the paper; it is what the construction requires.) This is where $\boldsymbol{\sigma}$ earns its name as a stochasticity dial. Turn it up and each step re-rolls more of its noise fresh; turn it to zero and each step reuses only the noise already in hand, which makes $\mathbf{x}_{t-1}$ a deterministic function of $\mathbf{x}_t$ and $\mathbf{x}_0$ .

The figure makes this checkable. The amber corridor is the marginal $q(\mathbf{x}_t\mid\mathbf{x}_0)$ , fixed for every setting. The teal curves are sample paths of the family. Slide $\eta$ , which scales $\boldsymbol{\sigma}$ (the exact definition comes in the next section): at $\eta=0$ the paths are smooth and pinned down by their noise endpoint, at $\eta=1$ they thrash like random walks, and at every setting they stay inside the same corridor and land on the same clean point.

Figure 1 · same marginals, different joints

stochasticity ηη = 0.00

The amber corridor is the marginal q(x_t | x₀): the only thing the training loss depends on, and it never moves. The teal paths are samples of the non-Markovian family. Drag η from 0 (smooth, deterministic given the endpoint) to 1 (a jagged random walk). Different joints, one shared marginal, so one trained network fits them all.

One question remains, and the paper answers it with a theorem. Is this really a valid family, or does changing $\boldsymbol{\sigma}$ actually require training a different network? The forward process here is no longer Markovian, since $\mathbf{x}_t$ is allowed to depend on both $\mathbf{x}_{t-1}$ and $\mathbf{x}_0$ , so the naive worry is that its reverse needs its own training run. The theorem says no: the variational objective for any $\boldsymbol{\sigma}$ equals the DDPM objective (2) plus a constant.

\forall\,\boldsymbol{\sigma}>\mathbf{0}: \quad J_{\boldsymbol{\sigma}}(\boldsymbol{\epsilon}_\theta) = L(\boldsymbol{\epsilon}_\theta) + C

A constant offset does not move the minimizer, so the network that was best for DDPM is best for every member of the family. One training run, and every $\boldsymbol{\sigma}$ is yours to pick at sampling time. (The clean version of this argument assumes the network uses separate weights per noise level; the real model shares one network across all levels, so in practice this is an excellent approximation rather than an exact identity, and the transfer across the family is borne out empirically.) With that permission slip in hand, we can go shopping for a sampler.

One sampling step, three pieces

To sample, replace the unknown clean image $\mathbf{x}_0$ in the reverse step (4) with the network's prediction $\hat{\mathbf{x}}_0$ from equation (3). Writing it out gives the master update, the one equation the method runs on:

\mathbf{x}_{t-1} = \sqrt{\alpha_{t-1}}\ \underbrace{\hat{\mathbf{x}}_0(\mathbf{x}_t)}_{\text{predicted image}} + \underbrace{\sqrt{1-\alpha_{t-1}-\sigma_t^2}\ \boldsymbol{\epsilon}_\theta^{(t)}(\mathbf{x}_t)}_{\text{direction back toward }\mathbf{x}_t} + \underbrace{\sigma_t\,\boldsymbol{\epsilon}_t}_{\text{fresh noise}}

(5)

Three pieces. Estimate where the clean image is, and step toward it. Add back part of the noise you came in with, pointing along the exact direction you arrived from. Then optionally sprinkle in fresh random noise. The middle term is often misread as random; it is not. It is built from the network's predicted noise, a fixed vector, so it is fully deterministic. Only the third term, $\sigma_t\boldsymbol{\epsilon}_t$ , is a coin flip.

Set $\sigma_t = 0$ for every step and the coin flip disappears. What is left is a deterministic map from $\mathbf{x}_t$ to $\mathbf{x}_{t-1}$ , and running it from pure noise all the way down gives a sample with no randomness after the first draw. The paper names this deterministic sampler the denoising diffusion implicit model: implicit in the technical sense of generating samples through a fixed procedure from a latent variable, the way a GAN or a normalizing flow does, rather than through an explicit step-by-step probability density. The sampling formula is perfectly explicit; the name points at the deterministic, latent-to-image character, not at any hidden math.

The figure takes one step apart. The noise budget at level $t-1$ is fixed at $\sqrt{1-\alpha_{t-1}}$ ; the dial $\eta$ only splits it between the deterministic direction and fresh noise. Concretely, suppose $\alpha_t = 0.30$ and $\alpha_{t-1} = 0.45$ . The budget is $\sqrt{1-0.45}\approx0.74$ . At $\eta=0$ all of it goes into the deterministic direction. At $\eta=1$ the noise is $\sigma\approx0.51$ and the direction shrinks to $\approx0.54$ , and indeed $0.54^2 + 0.51^2 \approx 0.55 = 1-\alpha_{t-1}$ . Same budget, different split.

Figure 2 · the master step, taken apart

stochasticity ηη = 0.00

From the noisy xₜ the network predicts the clean x̂₀. The next state is a base plus a teal deterministic step toward xₜ, plus a violet cloud of fresh noise. Slide η: at 0 the cloud collapses to one point (DDIM); toward 1 the deterministic step shrinks and the cloud fills in (DDPM). The bar shows the fixed budget being split.

To compare deterministic and stochastic sampling on equal footing, the paper picks a one-parameter slice of the $\boldsymbol{\sigma}$ family, controlled by a single scalar $\eta \ge 0$ :

\sigma_{t}(\eta) = \eta\,\sqrt{\frac{1-\alpha_{t-1}}{1-\alpha_{t}}}\,\sqrt{1-\frac{\alpha_{t}}{\alpha_{t-1}}}

(6)

At $\eta = 0$ the noise vanishes and you get DDIM. At $\eta = 1$ the variance is exactly DDPM's, and the forward process becomes Markovian again, so you recover the original stochastic sampler. Everything in between is a valid sampler off the same network. (The paper also tries one more variance it calls $\hat\sigma$ , an even larger choice from the original DDPM code; it is a separate setting, not a point on the $\eta$ line, and it behaves very differently, as the results will show.) So a single dial runs from fully deterministic to fully stochastic, and we can ask which end samples better when steps are scarce.

Skipping most of the steps

Speed comes from the marginals argument again. Because the loss and the network only ever depended on the marginals (1), sampling does not have to visit all $T$ noise levels. Pick any increasing sub-sequence $\tau = (\tau_1, \dots, \tau_S)$ of the full schedule, run the very same update (5) using $\tau$ 's levels, and traverse it in reverse, from the noisiest kept level down to the cleanest. With $S$ far smaller than $T$ , sampling costs $S$ network calls instead of a thousand, and nothing was retrained.

How you space the kept steps matters a little. The paper tries two rules: linear, evenly spaced, and quadratic, $\tau_i \propto i^2$ , which packs more steps toward the clean, low-noise end where fine detail is decided. Quadratic worked slightly better on CIFAR-10, linear on the larger datasets. Drag the step count and switch the spacing to see which levels get kept:

Figure 3 · keep a few steps of the thousand

steps SS = 20

An axis of 1000 training steps from clean to noise, with the noise level drawn as a curve. A chosen subset of S steps (teal dots) is what generation actually visits, connected by the reverse jumps it takes. Linear spreads them evenly; quadratic packs them at the clean end. Same trained model, a 1000/S-fold speed-up.

The choice of $\eta$ decides how well skipping works. Skipping steps means each step is a bigger jump, and a bigger jump is a coarser approximation. A deterministic DDIM step reuses the noise it already has, so a large jump just tracks the true path a little less precisely. A stochastic DDPM step throws away structure and injects fresh noise every time, and with only a handful of steps that injected noise never gets a chance to settle. Determinism tolerates big strides; stochasticity does not. The results section shows exactly how far apart they end up.

The full sampler is short. It is the same loop for DDIM and DDPM; only $\tau$ (which steps) and $\eta$ (how much fresh noise) change:

# One DDIM/DDPM sampling run. Same trained eps_theta either way;
# only tau (which steps) and eta (how random) change. eta=0 -> DDIM.
def sample(eps_theta, alpha, tau, eta):     # alpha = cumulative product
    x = randn(shape)                        # x_T: pure Gaussian noise
    for i in reversed(range(len(tau))):     # walk reversed(tau): noise -> data
        t, s = tau[i], tau[i - 1] if i else 0   # current, previous step
        a_t, a_s = alpha[t], alpha[s]           # a_s = 1.0 when s == 0
        eps = eps_theta(x, t)                    # one network call
        x0  = (x - sqrt(1 - a_t) * eps) / sqrt(a_t)     # predicted x_0
        sig = eta * sqrt((1 - a_s) / (1 - a_t)) * sqrt(1 - a_t / a_s)
        dir = sqrt(max(0, 1 - a_s - sig**2)) * eps      # toward x_t
        x = sqrt(a_s) * x0 + dir + sig * randn(shape)   # next state
    return x                                 # x_0: a sample

One network call per kept step, $\eta=0$ for deterministic sampling, the same weights you already trained. a_s = 1.0 when s == 0 is the convention $\alpha_0 := 1$ , the clean end of the schedule.

How fast, and at what cost

The headline claim is that DDIM reaches the quality of a 1000-step model in 20 to 100 steps, a 10 to 50 times speed-up in wall-clock time, with no retraining. The figure plots FID (the Fréchet Inception Distance, a standard image-quality score where lower is better) against the number of steps, for the same trained model sampled three ways: DDIM ( $\eta=0$ ), DDPM ( $\eta=1$ ), and the larger-variance $\hat\sigma$ . Drag the step count and read all three:

Figure 4 · quality versus number of steps

steps SS = 10

FID (log, lower is better) against sampling steps, one trained model. DDIM stays low and usable down to 10 steps; DDPM and the larger-variance σ̂ degrade fast when steps are few, though σ̂ is best of all at the full 1000. Drag S; toggle the dataset.

The numbers tell a clean story. On CIFAR-10 at just 10 steps, DDIM scores FID $13.36$ while DDPM at the same budget is $41.07$ and the larger-variance $\hat\sigma$ is a hopeless $367.43$ , because with so few steps its extra injected noise never averages out, and the leftover grain drives its FID up. On CelebA the crossover is easy to feel: the 100-step DDPM scores $13.93$ , essentially the same as the 20-step DDIM at $13.73$ , so DDIM reaches that quality in a fifth of the steps.

One number in the same table cuts the other way. At the full 1000 steps, the extra-noise $\hat\sigma$ sampler edges ahead, FID $3.17$ against DDIM's $4.04$ on CIFAR-10. When you can afford every step, a dash of stochasticity smooths the last bit of quality; DDIM's win is specifically the few-step regime, which is the regime anyone waiting on a sample actually cares about.

A latent space you can navigate

Determinism buys more than speed. Once $\sigma_t = 0$ , the entire generation is a fixed function of the starting noise $\mathbf{x}_T$ : same starting noise, same image, always. The paper checks this and finds something striking. Fix $\mathbf{x}_T$ and generate with 10 steps, then 100, then 1000: the high-level content, the face, the pose, the layout, stays the same across all of them, with only fine detail sharpening as you add steps. The number of steps is a quality knob; the starting noise is the image.

The figure makes the contrast concrete. Eight fixed starting points are each decoded to an image; drag the step count and watch where they land. Under DDIM the landings barely move, so $\mathbf{x}_T$ alone fixes the result. Flip to DDPM and the fresh noise injected at every step throws the landings around as you change the step count, so there the starting noise decides very little:

Figure 5 · same start, same image

steps SS = 6

Eight fixed noise starts, each decoded with the same model. The ring marks where a start lands with a 48-step run; the bright dot marks where it lands with the chosen S. Under DDIM the dots sit on their rings for any S. Flip to DDPM and they scatter, because fresh noise, not the start, decides the image.

If $\mathbf{x}_T$ alone determines the image, it behaves like a latent code for that image, the way the latent of a GAN or a variational autoencoder does. Two consequences follow, and both are things a DDPM cannot do. You can interpolate: take two images' latent codes, blend them, and decode, and because the decode is smooth the result is a semantically meaningful blend rather than a ghosty pixel average. And you can, in principle, run the process backward to find the latent code of a given real image, which the next section turns into near-lossless reconstruction. A stochastic DDPM has no stable code to manipulate, because its output depends on a thousand fresh coin flips along the way.

DDIM is really an ODE

A deterministic sampler tolerates big steps and stays invertible because it is a numerical solver for a differential equation. Rearrange the deterministic step (5) with $\sigma_t = 0$ and it lines up term for term with Euler's method, the simplest way to solve an ordinary differential equation (ODE): from the current point, take one small step along a velocity, repeat. Change variables to $\bar{\mathbf{x}} = \mathbf{x}/\sqrt{\alpha}$ and $\sigma = \sqrt{1-\alpha}/\sqrt{\alpha}$ , and the update becomes Euler steps on

\mathrm{d}\bar{\mathbf{x}}(t) = \boldsymbol{\epsilon}_\theta^{(t)}\!\Big(\frac{\bar{\mathbf{x}}(t)}{\sqrt{\sigma^2+1}}\Big)\,\mathrm{d}\sigma(t)

(7)

The network's noise prediction is the velocity, and $\sigma$ plays the role of time. (The tidy identity $\sigma^2 + 1 = 1/\alpha$ means $\bar{\mathbf{x}}/\sqrt{\sigma^2+1} = \mathbf{x}_t$ , so the network is always handed an ordinary noisy image, exactly what it was trained on.) This is the same observation behind Neural ODEs: a chain of residual-style updates is Euler's method applied to some hidden dynamics, and taking more steps just integrates the same curve more finely.

The figure shows it directly. The faint curve is the true continuous path linking a clean image to its noise code; the bright polyline is the $S$ -step DDIM approximation. Few steps cut the corners; more steps melt onto the curve. Because the map is an ODE, it runs both ways: encode an image to its code, decode the code back, and encode-then-decode returns to the start, with an error that shrinks as you add steps. On CIFAR-10 the per-pixel reconstruction error falls from $0.014$ at 10 steps to $0.0001$ at 1000, which is the numerical signature of a genuine, invertible flow:

Figure 6 · Euler steps on an invertible flow

steps SS = 6

The faint curve is the true ODE path from an image x₀ to its noise code x_T; the bright polyline is the S-step DDIM approximation. Few steps cut corners; more track the curve. Toggle encode (x₀→x_T) and decode (x_T→x₀); the round-trip error shrinks with S.

One nuance for anyone reading alongside the score-SDE paper. DDIM's ODE (7) is written with respect to $\mathrm{d}\sigma$ and carries no factor of a half, while the probability-flow ODE of the variance-exploding SDE is usually written with respect to $\mathrm{d}t$ and carries a $\tfrac{1}{2}g^2$ . These are the same continuous ODE: the change of variables $\mathrm{d}\sigma^2/\mathrm{d}t = 2\sigma\,\mathrm{d}\sigma/\mathrm{d}t$ cancels the half. What differs is the discretization: DDIM takes even steps in $\sigma$ , the score-SDE sampler takes them in $t$ , and with few steps those choices land in different places. DDIM's spacing happens to track the curve better on a coarse budget.

One substitution carried all of this. Because the training loss only measured single-level denoising, the thousand-step Markov chain was never load-bearing, and swapping it for a deterministic ODE costs nothing at training time and buys a fast, invertible sampler. The same trained diffusion model you already have is, underneath, an ODE you can integrate as coarsely as you dare, and DDIM turns that reading into a sampler.

Provenance Verified against primary literature

DDPM (Ho et al., 2020)The forward process, ε-prediction network, and the L₁ objective DDIM reuses unchanged.

NCSN / score matchingSame denoising objective; the score view of the same network (Vincent 2011; Song & Ermon 2019).

Score-SDE (Song et al., 2021)The probability-flow ODE; DDIM’s ODE is its variance-exploding special case.

Neural ODEs (Chen et al., 2018)A chain of residual updates read as Euler integration of a hidden ODE.

caveatThe paper's α_t is the CUMULATIVE product, what DDPM and most tutorials call ᾱ_t (alpha-bar). DDPM's per-step α = 1−β maps to the ratio α_t/α_{t-1} here. The paper only clarifies this in an appendix, so we front-load it; every equation is in the cumulative-α convention.

Questions you might still have

Do I need to retrain my diffusion model to use DDIM?
No. DDIM is built precisely so you do not have to. It uses the exact same trained network and the same weights; only the sampling loop changes. A theorem shows every member of the σ-family shares the DDPM objective up to a constant, so the model trained for DDPM is already optimal for DDIM.

Is DDIM just DDPM with fewer steps?
No, they are different samplers. DDPM (η=1) injects fresh noise at every step and degrades fast when steps are few. DDIM (η=0) is deterministic: it reuses the noise it already has, so big step-skips only cost a little accuracy. Fewer steps is a separate lever both can pull; determinism is what lets DDIM pull it hard.

Why is DDIM called "implicit"?
In the technical sense of an implicit probabilistic model: it generates samples through a fixed procedure from a latent variable (the starting noise x_T), like a GAN or a normalizing flow, rather than through an explicit step-by-step density. The sampling formula itself is perfectly explicit.

Is DDIM always better than DDPM?
Only when steps are scarce, which is the case that matters in practice. At the full 1000 steps a larger-variance DDPM (σ̂) actually edges DDIM out on FID (3.17 vs 4.04 on CIFAR-10). DDIM wins the few-step regime and gives you a stable latent code; the extra noise wins the last sliver of quality when compute is unlimited.

What does the deterministic version let me do that DDPM cannot?
Because the same x_T always decodes to the same image, x_T is a genuine latent code. You get consistency (10-step and 1000-step samples share high-level content), semantically meaningful interpolation between two images’ codes, and near-lossless encode-then-reconstruct, since the sampler is an invertible ODE.

Footnotes & further reading

The paper: Song, Meng, Ermon, Denoising Diffusion Implicit Models (Stanford, ICLR 2021). Code.
The model and objective DDIM reuses: Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (our explainer: DDPM). The α/ᾱ notation map is spelled out in DDIM's Appendix C.2.
The score / SDE / probability-flow ODE view DDIM connects to: Song et al., Score-Based Generative Modeling through SDEs (our explainer: Score-SDE), and the denoising-score-matching lineage in NCSN.
Residual updates as Euler integration of an ODE: Chen et al., Neural Ordinary Differential Equations (our explainer: Neural ODEs).
Table 1 (FID vs steps) and Table 2 (reconstruction error) are transcribed from the paper. Table 1 uses one model per dataset trained at T=1000; only the sampler (τ and η) changes. Reconstruction error at S steps: 0.014, 0.0065, 0.0023, 0.0009, 0.0004, 0.0001, 0.0001 for S = 10, 20, 50, 100, 200, 500, 1000.
For the broader map of how DDPM, score matching, and SDEs fit together, see our diffusion tutorial explainer, and flow matching for a related deterministic-generation idea.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.