VerifiedarXiv:2010.0250226 min
Diffusion · Sampling

Denoising Diffusion Implicit Models

Reuse a trained diffusion model, and sample it ten to fifty times faster.

Nothing about training changes. DDIM rebuilds only the sampling process, as a family of samplers that all fit the same trained network, and its deterministic member reaches full quality in a few dozen steps instead of a thousand.

Explaining the paperDenoising Diffusion Implicit ModelsSong, Meng, Ermon · Stanford · ICLR 2021 · arXiv:2010.02502

Take a diffusion model that already works, and make it sample in twenty steps instead of a thousand, without touching a single weight.

A denoising diffusion probabilistic model (DDPM) makes beautiful images and is miserably slow. To draw one sample it starts from pure noise and denoises in small steps, and it takes about a thousand of them, each one a full pass through a large network, run strictly one after another. The paper puts a number on it: sampling 50,000 tiny 32×3232\times32 images from a DDPM takes roughly 20 hours on a single 2080 Ti, against under a minute for a GAN. At 256×256256\times256 the same batch could take close to a thousand hours. A model you have to babysit for a day to see a few images is hard to put in front of anyone.

The obvious fix is to take fewer, bigger steps. It fails for a plain DDPM, and the reason is exactly what DDIM works around. A DDPM's sampler is derived as the reverse of one specific noising process, a Markov chain of a thousand steps, where each step adds a little Gaussian noise and depends only on the step just before it. Reverse that chain and you get a thousand-step denoiser and nothing shorter, because the recipe is tied to that exact chain. Drop most of the steps and the sampler is no longer inverting the process it was trained on.

DDIM keeps the trained network exactly as is and rebuilds the sampler from a different starting point. The opening it exploits is that the training objective never involved the thousand-step chain in the first place. It only ever scored the network on undoing a single noise level at a time, image by image. That one fact opens up a whole family of samplers that fit the same network, and one of them is deterministic and happy to take enormous strides.

A few ideas get us there: the training loss depends only on the per-step marginals; that frees a whole family of noising processes with the same marginals; one member of the family is deterministic; and a deterministic sampler both skips steps cheaply and doubles as an encoder. Each is small on its own.

What a diffusion model already gives us

DDIM inherits its network and its training loss from DDPM, so a quick recap grounds everything that follows. A diffusion model has a fixed forward process that adds Gaussian noise to a clean image x0\mathbf{x}_0 in TT steps until nothing is left but static. The one property that makes the math tractable is that you can jump to any noise level in a single shot: the distribution of the noised image xt\mathbf{x}_t given the clean one has a closed form,

q(xtx0)=N ⁣(xt; αtx0, (1αt)I),xt=αtx0+1αtϵq(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}\!\big(\mathbf{x}_t;\ \sqrt{\alpha_t}\,\mathbf{x}_0,\ (1-\alpha_t)\mathbf{I}\big), \qquad \mathbf{x}_t = \sqrt{\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\alpha_t}\,\boldsymbol{\epsilon}
(1)

with ϵN(0,I)\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}). Read it as a fixed power budget: αt\sqrt{\alpha_t} turns the signal down and 1αt\sqrt{1-\alpha_t} turns the static up, geared so the total variance stays at one. As tt grows, αt\alpha_t falls from near one toward zero and the image dissolves into a standard Gaussian.

One notation warning, and it is the single biggest source of confusion in this paper. Here αt\alpha_t is the cumulative product of the per-step signal-retention factors, so it is what the original DDPM paper and most tutorials write as αˉt\bar\alpha_t (alpha-bar). DDPM's per-step αt=1βt\alpha_t = 1-\beta_t corresponds here to the ratio αt/αt1\alpha_t/\alpha_{t-1}. The paper only spells this out in an appendix, with no footnote near the equations, so a reader arriving from DDPM sees a formula that looks wrong until they catch the redefinition. Every equation below uses this cumulative convention: αt\alpha_t shrinks from α0=1\alpha_0 = 1 down toward zero.

The network is trained to look at a noised image and predict the noise that was added to it. Write it ϵθ(t)(xt)\boldsymbol{\epsilon}_\theta^{(t)}(\mathbf{x}_t). The loss is the plainest thing imaginable, a squared error between the predicted and the actual noise, averaged over clean images, noise levels, and noise draws:

L(ϵθ)=t=1TEx0,ϵ[ ϵθ(t) ⁣(αtx0+1αtϵ)ϵ22 ]L(\boldsymbol{\epsilon}_\theta) = \sum_{t=1}^{T} \mathbb{E}_{\mathbf{x}_0,\,\boldsymbol{\epsilon}}\Big[\ \big\lVert\, \boldsymbol{\epsilon}_\theta^{(t)}\!\big(\sqrt{\alpha_t}\,\mathbf{x}_0 + \sqrt{1-\alpha_t}\,\boldsymbol{\epsilon}\big) - \boldsymbol{\epsilon}\,\big\rVert_2^2\ \Big]
(2)

(The paper writes this as L1L_{\mathbf{1}}, where the subscript is the all-ones weight vector, one weight per noise level, all set to one. It does not mean an L1 or absolute-value loss; the error inside is squared. This is the same objective a noise-conditional score network minimizes, which is why score-based and diffusion models are two dialects of one idea.) Predicting the noise is the same as predicting the clean image, because equation (1) is invertible: rearrange it and the network's guess of the clean x0\mathbf{x}_0 is

x^0(xt)=xt1αt ϵθ(t)(xt)αt\hat{\mathbf{x}}_0(\mathbf{x}_t) = \frac{\mathbf{x}_t - \sqrt{1-\alpha_t}\ \boldsymbol{\epsilon}_\theta^{(t)}(\mathbf{x}_t)}{\sqrt{\alpha_t}}
(3)

Predicted noise and predicted image carry the same information, and we will use whichever is convenient. That is all DDIM inherits: a network that denoises at every level, and a loss (2) that trained it. Everything DDIM adds happens after training, in how you turn that network into a sampler.

The loss only sees the marginals

Look hard at the loss (2). Every term takes a clean image, jumps straight to one noise level with equation (1), and penalizes the network's error in undoing that single jump. No term chains two noisy images together, and none involves how xt\mathbf{x}_t and xt1\mathbf{x}_{t-1} relate along the way. In the language of probability, the loss depends only on the marginals q(xtx0)q(\mathbf{x}_t \mid \mathbf{x}_0), the distribution of a single noised image, never on the joint distribution of the entire noisy sequence.

Everything DDIM does starts from that distinction. Think of grading Polaroids against their subjects: the rubric compares each finished photo to the person in it, and never asks in what order the roll was shot. Many different shooting orders produce the same graded photos. In the same way, many different noising processes, with different step-to-step correlations, can share the exact same per-level marginals (1). The DDPM Markov chain is only one of them. Any other process with the same marginals produces the same loss, so it is served by the same trained network, with no retraining.

So the question becomes concrete: what other noising processes share DDPM's marginals, and do any of them reverse into a faster or nicer sampler? The next section builds exactly that family and finds that one member is deterministic.

A family of forward processes

The paper writes down a family of processes, indexed by a vector σ0\boldsymbol{\sigma} \ge \mathbf{0}, engineered so that every member keeps the marginals (1) exactly. The design is easiest to read backward, as the reverse step: given a clean x0\mathbf{x}_0 and a noised xt\mathbf{x}_t, the previous, slightly cleaner xt1\mathbf{x}_{t-1} is drawn from

qσ(xt1xt,x0)=N ⁣(αt1x0+1αt1σt2  xtαtx01αt,  σt2I)q_{\boldsymbol{\sigma}}(\mathbf{x}_{t-1} \mid \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}\!\Big(\sqrt{\alpha_{t-1}}\,\mathbf{x}_0 + \sqrt{1-\alpha_{t-1}-\sigma_t^2}\ \cdot\ \frac{\mathbf{x}_t - \sqrt{\alpha_t}\,\mathbf{x}_0}{\sqrt{1-\alpha_t}},\ \ \sigma_t^2\mathbf{I}\Big)
(4)

The mean has two parts. The first, αt1x0\sqrt{\alpha_{t-1}}\,\mathbf{x}_0, is the clean image scaled down to the next noise level. The second reuses the noise already present in xt\mathbf{x}_t: the fraction (xtαtx0)/1αt(\mathbf{x}_t - \sqrt{\alpha_t}\mathbf{x}_0)/\sqrt{1-\alpha_t} is exactly the unit noise that was mixed into xt\mathbf{x}_t, and we lay some of it back down. The variance is a free knob σt2\sigma_t^2.

The odd-looking coefficient 1αt1σt2\sqrt{1-\alpha_{t-1}-\sigma_t^2} is the heart of the construction, and it is chosen for one reason: the two sources of spread have to add up to the right total. The noise carried over contributes variance 1αt1σt21-\alpha_{t-1}-\sigma_t^2, and the fresh noise contributes σt2\sigma_t^2. They sum by design:

(1αt1σt2)carried-over noise+σt2fresh noise=1αt1\underbrace{(1-\alpha_{t-1}-\sigma_t^2)}_{\text{carried-over noise}} + \underbrace{\sigma_t^2}_{\text{fresh noise}} = 1-\alpha_{t-1}

and 1αt11-\alpha_{t-1} is precisely the variance of the marginal (1) at level t1t-1. So whatever you set σt\sigma_t to, the marginal comes out unchanged. (The one restriction is that σt21αt1\sigma_t^2 \le 1-\alpha_{t-1}, so the square root stays real. It is not printed as a rule in the paper; it is what the construction requires.) This is where σ\boldsymbol{\sigma} earns its name as a stochasticity dial. Turn it up and each step re-rolls more of its noise fresh; turn it to zero and each step reuses only the noise already in hand, which makes xt1\mathbf{x}_{t-1} a deterministic function of xt\mathbf{x}_t and x0\mathbf{x}_0.

The figure makes this checkable. The amber corridor is the marginal q(xtx0)q(\mathbf{x}_t\mid\mathbf{x}_0), fixed for every setting. The teal curves are sample paths of the family. Slide η\eta, which scales σ\boldsymbol{\sigma} (the exact definition comes in the next section): at η=0\eta=0 the paths are smooth and pinned down by their noise endpoint, at η=1\eta=1 they thrash like random walks, and at every setting they stay inside the same corridor and land on the same clean point.

Figure 1 · same marginals, different joints
η = 0.00
The amber corridor is the marginal q(x_t | x₀): the only thing the training loss depends on, and it never moves. The teal paths are samples of the non-Markovian family. Drag η from 0 (smooth, deterministic given the endpoint) to 1 (a jagged random walk). Different joints, one shared marginal, so one trained network fits them all.

One question remains, and the paper answers it with a theorem. Is this really a valid family, or does changing σ\boldsymbol{\sigma} actually require training a different network? The forward process here is no longer Markovian, since xt\mathbf{x}_t is allowed to depend on both xt1\mathbf{x}_{t-1} and x0\mathbf{x}_0, so the naive worry is that its reverse needs its own training run. The theorem says no: the variational objective for any σ\boldsymbol{\sigma} equals the DDPM objective (2) plus a constant.

σ>0:Jσ(ϵθ)=L(ϵθ)+C\forall\,\boldsymbol{\sigma}>\mathbf{0}: \quad J_{\boldsymbol{\sigma}}(\boldsymbol{\epsilon}_\theta) = L(\boldsymbol{\epsilon}_\theta) + C

A constant offset does not move the minimizer, so the network that was best for DDPM is best for every member of the family. One training run, and every σ\boldsymbol{\sigma} is yours to pick at sampling time. (The clean version of this argument assumes the network uses separate weights per noise level; the real model shares one network across all levels, so in practice this is an excellent approximation rather than an exact identity, and the transfer across the family is borne out empirically.) With that permission slip in hand, we can go shopping for a sampler.

One sampling step, three pieces

To sample, replace the unknown clean image x0\mathbf{x}_0 in the reverse step (4) with the network's prediction x^0\hat{\mathbf{x}}_0 from equation (3). Writing it out gives the master update, the one equation the method runs on:

xt1=αt1 x^0(xt)predicted image+1αt1σt2 ϵθ(t)(xt)direction back toward xt+σtϵtfresh noise\mathbf{x}_{t-1} = \sqrt{\alpha_{t-1}}\ \underbrace{\hat{\mathbf{x}}_0(\mathbf{x}_t)}_{\text{predicted image}} + \underbrace{\sqrt{1-\alpha_{t-1}-\sigma_t^2}\ \boldsymbol{\epsilon}_\theta^{(t)}(\mathbf{x}_t)}_{\text{direction back toward }\mathbf{x}_t} + \underbrace{\sigma_t\,\boldsymbol{\epsilon}_t}_{\text{fresh noise}}
(5)

Three pieces. Estimate where the clean image is, and step toward it. Add back part of the noise you came in with, pointing along the exact direction you arrived from. Then optionally sprinkle in fresh random noise. The middle term is often misread as random; it is not. It is built from the network's predicted noise, a fixed vector, so it is fully deterministic. Only the third term, σtϵt\sigma_t\boldsymbol{\epsilon}_t, is a coin flip.

Set σt=0\sigma_t = 0 for every step and the coin flip disappears. What is left is a deterministic map from xt\mathbf{x}_t to xt1\mathbf{x}_{t-1}, and running it from pure noise all the way down gives a sample with no randomness after the first draw. The paper names this deterministic sampler the denoising diffusion implicit model: implicit in the technical sense of generating samples through a fixed procedure from a latent variable, the way a GAN or a normalizing flow does, rather than through an explicit step-by-step probability density. The sampling formula is perfectly explicit; the name points at the deterministic, latent-to-image character, not at any hidden math.

The figure takes one step apart. The noise budget at level t1t-1 is fixed at 1αt1\sqrt{1-\alpha_{t-1}}; the dial η\eta only splits it between the deterministic direction and fresh noise. Concretely, suppose αt=0.30\alpha_t = 0.30 and αt1=0.45\alpha_{t-1} = 0.45. The budget is 10.450.74\sqrt{1-0.45}\approx0.74. At η=0\eta=0 all of it goes into the deterministic direction. At η=1\eta=1 the noise is σ0.51\sigma\approx0.51 and the direction shrinks to 0.54\approx0.54, and indeed 0.542+0.5120.55=1αt10.54^2 + 0.51^2 \approx 0.55 = 1-\alpha_{t-1}. Same budget, different split.

Figure 2 · the master step, taken apart
η = 0.00
From the noisy xₜ the network predicts the clean x̂₀. The next state is a base plus a teal deterministic step toward xₜ, plus a violet cloud of fresh noise. Slide η: at 0 the cloud collapses to one point (DDIM); toward 1 the deterministic step shrinks and the cloud fills in (DDPM). The bar shows the fixed budget being split.

To compare deterministic and stochastic sampling on equal footing, the paper picks a one-parameter slice of the σ\boldsymbol{\sigma} family, controlled by a single scalar η0\eta \ge 0:

σt(η)=η1αt11αt1αtαt1\sigma_{t}(\eta) = \eta\,\sqrt{\frac{1-\alpha_{t-1}}{1-\alpha_{t}}}\,\sqrt{1-\frac{\alpha_{t}}{\alpha_{t-1}}}
(6)

At η=0\eta = 0 the noise vanishes and you get DDIM. At η=1\eta = 1 the variance is exactly DDPM's, and the forward process becomes Markovian again, so you recover the original stochastic sampler. Everything in between is a valid sampler off the same network. (The paper also tries one more variance it calls σ^\hat\sigma, an even larger choice from the original DDPM code; it is a separate setting, not a point on the η\eta line, and it behaves very differently, as the results will show.) So a single dial runs from fully deterministic to fully stochastic, and we can ask which end samples better when steps are scarce.

Skipping most of the steps

Speed comes from the marginals argument again. Because the loss and the network only ever depended on the marginals (1), sampling does not have to visit all TT noise levels. Pick any increasing sub-sequence τ=(τ1,,τS)\tau = (\tau_1, \dots, \tau_S) of the full schedule, run the very same update (5) using τ\tau's levels, and traverse it in reverse, from the noisiest kept level down to the cleanest. With SS far smaller than TT, sampling costs SS network calls instead of a thousand, and nothing was retrained.

How you space the kept steps matters a little. The paper tries two rules: linear, evenly spaced, and quadratic, τii2\tau_i \propto i^2, which packs more steps toward the clean, low-noise end where fine detail is decided. Quadratic worked slightly better on CIFAR-10, linear on the larger datasets. Drag the step count and switch the spacing to see which levels get kept:

Figure 3 · keep a few steps of the thousand
S = 20
An axis of 1000 training steps from clean to noise, with the noise level drawn as a curve. A chosen subset of S steps (teal dots) is what generation actually visits, connected by the reverse jumps it takes. Linear spreads them evenly; quadratic packs them at the clean end. Same trained model, a 1000/S-fold speed-up.

The choice of η\eta decides how well skipping works. Skipping steps means each step is a bigger jump, and a bigger jump is a coarser approximation. A deterministic DDIM step reuses the noise it already has, so a large jump just tracks the true path a little less precisely. A stochastic DDPM step throws away structure and injects fresh noise every time, and with only a handful of steps that injected noise never gets a chance to settle. Determinism tolerates big strides; stochasticity does not. The results section shows exactly how far apart they end up.

The full sampler is short. It is the same loop for DDIM and DDPM; only τ\tau (which steps) and η\eta (how much fresh noise) change:

# One DDIM/DDPM sampling run. Same trained eps_theta either way;
# only tau (which steps) and eta (how random) change. eta=0 -> DDIM.
def sample(eps_theta, alpha, tau, eta):     # alpha = cumulative product
    x = randn(shape)                        # x_T: pure Gaussian noise
    for i in reversed(range(len(tau))):     # walk reversed(tau): noise -> data
        t, s = tau[i], tau[i - 1] if i else 0   # current, previous step
        a_t, a_s = alpha[t], alpha[s]           # a_s = 1.0 when s == 0
        eps = eps_theta(x, t)                    # one network call
        x0  = (x - sqrt(1 - a_t) * eps) / sqrt(a_t)     # predicted x_0
        sig = eta * sqrt((1 - a_s) / (1 - a_t)) * sqrt(1 - a_t / a_s)
        dir = sqrt(max(0, 1 - a_s - sig**2)) * eps      # toward x_t
        x = sqrt(a_s) * x0 + dir + sig * randn(shape)   # next state
    return x                                 # x_0: a sample

One network call per kept step, η=0\eta=0 for deterministic sampling, the same weights you already trained. a_s = 1.0 when s == 0 is the convention α0:=1\alpha_0 := 1, the clean end of the schedule.

How fast, and at what cost

The headline claim is that DDIM reaches the quality of a 1000-step model in 20 to 100 steps, a 10 to 50 times speed-up in wall-clock time, with no retraining. The figure plots FID (the Fréchet Inception Distance, a standard image-quality score where lower is better) against the number of steps, for the same trained model sampled three ways: DDIM (η=0\eta=0), DDPM (η=1\eta=1), and the larger-variance σ^\hat\sigma. Drag the step count and read all three:

Figure 4 · quality versus number of steps
S = 10
FID (log, lower is better) against sampling steps, one trained model. DDIM stays low and usable down to 10 steps; DDPM and the larger-variance σ̂ degrade fast when steps are few, though σ̂ is best of all at the full 1000. Drag S; toggle the dataset.

The numbers tell a clean story. On CIFAR-10 at just 10 steps, DDIM scores FID 13.3613.36 while DDPM at the same budget is 41.0741.07 and the larger-variance σ^\hat\sigma is a hopeless 367.43367.43, because with so few steps its extra injected noise never averages out, and the leftover grain drives its FID up. On CelebA the crossover is easy to feel: the 100-step DDPM scores 13.9313.93, essentially the same as the 20-step DDIM at 13.7313.73, so DDIM reaches that quality in a fifth of the steps.

One number in the same table cuts the other way. At the full 1000 steps, the extra-noise σ^\hat\sigma sampler edges ahead, FID 3.173.17 against DDIM's 4.044.04 on CIFAR-10. When you can afford every step, a dash of stochasticity smooths the last bit of quality; DDIM's win is specifically the few-step regime, which is the regime anyone waiting on a sample actually cares about.

A latent space you can navigate

Determinism buys more than speed. Once σt=0\sigma_t = 0, the entire generation is a fixed function of the starting noise xT\mathbf{x}_T: same starting noise, same image, always. The paper checks this and finds something striking. Fix xT\mathbf{x}_T and generate with 10 steps, then 100, then 1000: the high-level content, the face, the pose, the layout, stays the same across all of them, with only fine detail sharpening as you add steps. The number of steps is a quality knob; the starting noise is the image.

The figure makes the contrast concrete. Eight fixed starting points are each decoded to an image; drag the step count and watch where they land. Under DDIM the landings barely move, so xT\mathbf{x}_T alone fixes the result. Flip to DDPM and the fresh noise injected at every step throws the landings around as you change the step count, so there the starting noise decides very little:

Figure 5 · same start, same image
S = 6
Eight fixed noise starts, each decoded with the same model. The ring marks where a start lands with a 48-step run; the bright dot marks where it lands with the chosen S. Under DDIM the dots sit on their rings for any S. Flip to DDPM and they scatter, because fresh noise, not the start, decides the image.

If xT\mathbf{x}_T alone determines the image, it behaves like a latent code for that image, the way the latent of a GAN or a variational autoencoder does. Two consequences follow, and both are things a DDPM cannot do. You can interpolate: take two images' latent codes, blend them, and decode, and because the decode is smooth the result is a semantically meaningful blend rather than a ghosty pixel average. And you can, in principle, run the process backward to find the latent code of a given real image, which the next section turns into near-lossless reconstruction. A stochastic DDPM has no stable code to manipulate, because its output depends on a thousand fresh coin flips along the way.

DDIM is really an ODE

A deterministic sampler tolerates big steps and stays invertible because it is a numerical solver for a differential equation. Rearrange the deterministic step (5) with σt=0\sigma_t = 0 and it lines up term for term with Euler's method, the simplest way to solve an ordinary differential equation (ODE): from the current point, take one small step along a velocity, repeat. Change variables to xˉ=x/α\bar{\mathbf{x}} = \mathbf{x}/\sqrt{\alpha} and σ=1α/α\sigma = \sqrt{1-\alpha}/\sqrt{\alpha}, and the update becomes Euler steps on

dxˉ(t)=ϵθ(t) ⁣(xˉ(t)σ2+1)dσ(t)\mathrm{d}\bar{\mathbf{x}}(t) = \boldsymbol{\epsilon}_\theta^{(t)}\!\Big(\frac{\bar{\mathbf{x}}(t)}{\sqrt{\sigma^2+1}}\Big)\,\mathrm{d}\sigma(t)
(7)

The network's noise prediction is the velocity, and σ\sigma plays the role of time. (The tidy identity σ2+1=1/α\sigma^2 + 1 = 1/\alpha means xˉ/σ2+1=xt\bar{\mathbf{x}}/\sqrt{\sigma^2+1} = \mathbf{x}_t, so the network is always handed an ordinary noisy image, exactly what it was trained on.) This is the same observation behind Neural ODEs: a chain of residual-style updates is Euler's method applied to some hidden dynamics, and taking more steps just integrates the same curve more finely.

The figure shows it directly. The faint curve is the true continuous path linking a clean image to its noise code; the bright polyline is the SS-step DDIM approximation. Few steps cut the corners; more steps melt onto the curve. Because the map is an ODE, it runs both ways: encode an image to its code, decode the code back, and encode-then-decode returns to the start, with an error that shrinks as you add steps. On CIFAR-10 the per-pixel reconstruction error falls from 0.0140.014 at 10 steps to 0.00010.0001 at 1000, which is the numerical signature of a genuine, invertible flow:

Figure 6 · Euler steps on an invertible flow
S = 6
The faint curve is the true ODE path from an image x₀ to its noise code x_T; the bright polyline is the S-step DDIM approximation. Few steps cut corners; more track the curve. Toggle encode (x₀→x_T) and decode (x_T→x₀); the round-trip error shrinks with S.

One nuance for anyone reading alongside the score-SDE paper. DDIM's ODE (7) is written with respect to dσ\mathrm{d}\sigma and carries no factor of a half, while the probability-flow ODE of the variance-exploding SDE is usually written with respect to dt\mathrm{d}t and carries a 12g2\tfrac{1}{2}g^2. These are the same continuous ODE: the change of variables dσ2/dt=2σdσ/dt\mathrm{d}\sigma^2/\mathrm{d}t = 2\sigma\,\mathrm{d}\sigma/\mathrm{d}t cancels the half. What differs is the discretization: DDIM takes even steps in σ\sigma, the score-SDE sampler takes them in tt, and with few steps those choices land in different places. DDIM's spacing happens to track the curve better on a coarse budget.

One substitution carried all of this. Because the training loss only measured single-level denoising, the thousand-step Markov chain was never load-bearing, and swapping it for a deterministic ODE costs nothing at training time and buys a fast, invertible sampler. The same trained diffusion model you already have is, underneath, an ODE you can integrate as coarsely as you dare, and DDIM turns that reading into a sampler.

Provenance Verified against primary literature
DDPM (Ho et al., 2020)The forward process, ε-prediction network, and the L₁ objective DDIM reuses unchanged.
NCSN / score matchingSame denoising objective; the score view of the same network (Vincent 2011; Song & Ermon 2019).
Score-SDE (Song et al., 2021)The probability-flow ODE; DDIM’s ODE is its variance-exploding special case.
Neural ODEs (Chen et al., 2018)A chain of residual updates read as Euler integration of a hidden ODE.
caveatThe paper's α_t is the CUMULATIVE product, what DDPM and most tutorials call ᾱ_t (alpha-bar). DDPM's per-step α = 1−β maps to the ratio α_t/α_{t-1} here. The paper only clarifies this in an appendix, so we front-load it; every equation is in the cumulative-α convention.

Questions you might still have

?

Do I need to retrain my diffusion model to use DDIM?
No. DDIM is built precisely so you do not have to. It uses the exact same trained network and the same weights; only the sampling loop changes. A theorem shows every member of the σ-family shares the DDPM objective up to a constant, so the model trained for DDPM is already optimal for DDIM.

?

Is DDIM just DDPM with fewer steps?
No, they are different samplers. DDPM (η=1) injects fresh noise at every step and degrades fast when steps are few. DDIM (η=0) is deterministic: it reuses the noise it already has, so big step-skips only cost a little accuracy. Fewer steps is a separate lever both can pull; determinism is what lets DDIM pull it hard.

?

Why is DDIM called "implicit"?
In the technical sense of an implicit probabilistic model: it generates samples through a fixed procedure from a latent variable (the starting noise x_T), like a GAN or a normalizing flow, rather than through an explicit step-by-step density. The sampling formula itself is perfectly explicit.

?

Is DDIM always better than DDPM?
Only when steps are scarce, which is the case that matters in practice. At the full 1000 steps a larger-variance DDPM (σ̂) actually edges DDIM out on FID (3.17 vs 4.04 on CIFAR-10). DDIM wins the few-step regime and gives you a stable latent code; the extra noise wins the last sliver of quality when compute is unlimited.

?

What does the deterministic version let me do that DDPM cannot?
Because the same x_T always decodes to the same image, x_T is a genuine latent code. You get consistency (10-step and 1000-step samples share high-level content), semantically meaningful interpolation between two images’ codes, and near-lossless encode-then-reconstruct, since the sampler is an invertible ODE.

Footnotes & further reading

  1. The paper: Song, Meng, Ermon, Denoising Diffusion Implicit Models (Stanford, ICLR 2021). Code.
  2. The model and objective DDIM reuses: Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (our explainer: DDPM). The α/ᾱ notation map is spelled out in DDIM's Appendix C.2.
  3. The score / SDE / probability-flow ODE view DDIM connects to: Song et al., Score-Based Generative Modeling through SDEs (our explainer: Score-SDE), and the denoising-score-matching lineage in NCSN.
  4. Residual updates as Euler integration of an ODE: Chen et al., Neural Ordinary Differential Equations (our explainer: Neural ODEs).
  5. Table 1 (FID vs steps) and Table 2 (reconstruction error) are transcribed from the paper. Table 1 uses one model per dataset trained at T=1000; only the sampler (τ and η) changes. Reconstruction error at S steps: 0.014, 0.0065, 0.0023, 0.0009, 0.0004, 0.0001, 0.0001 for S = 10, 20, 50, 100, 200, 500, 1000.
  6. For the broader map of how DDPM, score matching, and SDEs fit together, see our diffusion tutorial explainer, and flow matching for a related deterministic-generation idea.