VerifiedarXiv:2207.1259828 min
Diffusion · Conditional generation

Classifier-Free Diffusion Guidance

Run a diffusion model with the label and without it, then amplify the difference.

Classifier guidance let diffusion models trade diversity for fidelity, but it needed a second network trained on noisy images. Classifier-free guidance gets the same control from the generator alone, at the cost of one extra line of code.

Explaining the paperClassifier-Free Diffusion GuidanceHo, Salimans · Google Research · 2022 · arXiv:2207.12598

Every text-to-image model has a guidance slider. This is the paper that put it there, and it runs on two forward passes and a subtraction.

Some generative models come with a free dial for trading variety against quality. Sample near the center of what the model has learned and you get clean, typical, slightly boring outputs. Reach for the rare edges and you get surprising ones that are also more often malformed. The original GAN-era models exposed this dial at sampling time with no retraining: BigGAN resamples any latent coordinate that lands too far out, and Glow scales its noise down by a temperature. Turn the dial one way for a faithful album of stereotypical samples, the other way for a diverse but messier one.

Diffusion models seemed to have no such dial. The obvious moves do not work: scaling up the model's predicted score, or shrinking the noise added back during sampling, both just produce blurry, low-quality images (this is a finding of Dhariwal & Nichol, not something the present paper re-derives). So for a while a diffusion model gave you whatever diversity it gave you, with no dial to turn.

Dhariwal & Nichol's classifier guidance fixed that by steering each denoising step with the gradient of a separate image classifier, nudging every sample toward something the classifier recognizes more confidently. It worked, at the price of an awkward dependency: a whole second network, one that has to be trained on the half-noised images the diffusion model produces mid-generation, and whose gradient also makes the procedure look unsettlingly like an adversarial attack on that classifier.

Ho and Salimans show you can have the dial with no classifier at all. Run the diffusion model twice, once told the label and once not, and amplify the gap between the two predictions. One extra line at training time (sometimes hide the label), one extra line at sampling time (mix the two predictions), no second network, no extra parameters. This is the guidance_scale knob in Stable Diffusion and in every text-to-image model that followed it.

A few ideas explain it: what a diffusion model actually predicts, why that prediction is a score, what classifier guidance does with the score, and how Bayes' rule lets you throw the classifier away. Each is a short step on its own.

The missing fidelity knob

Pin down what the dial controls, because the rest of the paper is about rebuilding it. A generative model defines a probability distribution over outputs. Most of the probability mass sits on typical, unambiguous examples; the tails hold the unusual ones. Two qualities are in tension. Fidelity is how convincing and recognizable each individual sample is. Diversity is how much of the real variety the samples cover. Sampling honestly from the model gives you its natural mix of both. The dial lets you push the samples toward the high-probability core, buying fidelity by giving up the tails.

BigGAN and Glow could do this because both map a single noise vector straight to an output, so shrinking the range of that noise tightens the outputs around the typical ones. A diffusion model has no single noise vector to shrink. It builds a sample through a long chain of small denoising steps, and there is no one place to turn the dial. That is why the naive analogues fail, and why guidance had to be a genuinely different idea rather than a smaller version of truncation.

A diffusion model in one screen

Enough diffusion to read everything below. Start with a clean image xx. The forward process gradually destroys it by mixing in Gaussian noise. This paper uses the variance-preserving convention (the one from the DDPM line and from Variational Diffusion Models), where the signal is shrunk by α\alpha as noise of scale σ\sigma is added:

zλ=αλx+σλϵ,ϵN(0,I),αλ2=11+eλ,σλ2=1αλ2z_\lambda = \alpha_\lambda\, x + \sigma_\lambda\, \boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}), \qquad \alpha_\lambda^2 = \tfrac{1}{1+e^{-\lambda}}, \quad \sigma_\lambda^2 = 1-\alpha_\lambda^2
(1)

The two coefficients are coupled so that αλ2+σλ2=1\alpha_\lambda^2 + \sigma_\lambda^2 = 1 exactly. The total variance stays pinned near one no matter how much noise you add, which is what "variance-preserving" means. (This is the opposite bookkeeping from the variance-exploding convention used on our geometry-of-noise and score-SDE pages, where the signal is left alone and the noise variance grows without bound. Same family of models, and you have to keep the two conventions apart, because the denoiser carries an αλ\alpha_\lambda factor here that is absent there.)

The single dial from clean to noise is λ\lambda, and it is worth slowing down on, because its direction is the reverse of what you might expect. It is the log signal-to-noise ratio, λ=log(αλ2/σλ2)\lambda = \log(\alpha_\lambda^2/\sigma_\lambda^2), the log of how much the surviving signal outweighs the noise. High λ\lambda means clean; very negative λ\lambda means almost pure noise. So the forward, destructive process runs toward decreasing λ\lambda, and generation runs the other way, from λmin\lambda_{\min} up to λmax\lambda_{\max} (the paper uses 20-20 to 2020). When a sampling loop below counts upward, the image is getting cleaner, not noisier.

The network does not predict the clean image directly. It predicts the noise. Show it a noised zλz_\lambda, and it returns a guess ϵθ(zλ)\boldsymbol{\epsilon}_\theta(z_\lambda) of the ϵ\boldsymbol{\epsilon} that was mixed in. Because zλ=αλx+σλϵz_\lambda = \alpha_\lambda x + \sigma_\lambda \boldsymbol{\epsilon} is a linear equation, knowing the noise is the same as knowing the clean image; you solve for one from the other:

xθ(zλ)=zλσλϵθ(zλ)αλx_\theta(z_\lambda) = \frac{z_\lambda - \sigma_\lambda\,\boldsymbol{\epsilon}_\theta(z_\lambda)}{\alpha_\lambda}

Predicting the noise and predicting the image are the same parameterization seen from two sides, related by that invertible map. (They are not identical as training targets, since each weights the noise levels differently and DDPM found noise-prediction gave better samples, but as objects they carry the same information.) One caveat to file away: that recovered xθx_\theta is the average of every clean image that could have produced this noisy one, not the single true original, so at high noise it looks like a blur of many images. That is a correct conditional mean, not a failure.

Training is then the plainest thing imaginable. Add known noise, ask the network to name it, penalize the squared error:

Eϵ,λ[ϵθ(zλ)ϵ22],λp(λ)\mathbb{E}_{\boldsymbol{\epsilon},\,\lambda}\Big[\, \big\lVert \boldsymbol{\epsilon}_\theta(z_\lambda) - \boldsymbol{\epsilon} \big\rVert_2^2 \,\Big], \qquad \lambda \sim p(\lambda)
(5)

One mean-squared error on the noise, averaged over images and over noise levels drawn from a schedule p(λ)p(\lambda). (With a uniform p(λ)p(\lambda) this is proportional to a variational bound on likelihood; the schedule the paper actually uses is a deliberate reweighting tuned for sample quality, so it is a weighted bound, not the exact one.) For conditional generation, the only change is that the network also gets the conditioning cc, a class label or a text prompt, written ϵθ(zλ,c)\boldsymbol{\epsilon}_\theta(z_\lambda, c).

Noise prediction is the score

The noise the network predicts is, up to a known factor, the score of the noised data: the gradient of its log-density, zlogp(zλ)\nabla_{z}\log p(z_\lambda), the direction in which the noisy data becomes more probable. That identity is the hinge everything below turns on. The two are related by

ϵθ(zλ)    σλzlogp(zλ)zlogp(zλ)    ϵθ(zλ)σλ\boldsymbol{\epsilon}_\theta(z_\lambda) \;\approx\; -\,\sigma_\lambda\, \nabla_{z}\log p(z_\lambda) \qquad\Longleftrightarrow\qquad \nabla_{z}\log p(z_\lambda) \;\approx\; -\,\frac{\boldsymbol{\epsilon}_\theta(z_\lambda)}{\sigma_\lambda}

Both halves of that relation matter. The minus sign says the score points the opposite way from the noise: the noise points away from the data, so undoing it (the score) points back toward the data. The single power of σλ\sigma_\lambda sets the length. This identity is denoising score matching, the same fact that the score-based view of diffusion is built on, and it comes straight from Tweedie's formula. For a Gaussian-noised point, the best least-squares guess of the clean signal is the posterior mean

E[xzλ]=zλ+σλ2zlogp(zλ)αλ\mathbb{E}[x \mid z_\lambda] = \frac{z_\lambda + \sigma_\lambda^2\, \nabla_{z}\log p(z_\lambda)}{\alpha_\lambda}

the noisy point nudged along the score by σλ2\sigma_\lambda^2, then rescaled to undo the signal shrink. That posterior mean is exactly the xθx_\theta the network recovers, and feeding it back through xθ=(zλσλϵθ)/αλx_\theta = (z_\lambda - \sigma_\lambda\boldsymbol{\epsilon}_\theta)/\alpha_\lambda forces the noise prediction to be σλzlogp-\sigma_\lambda\,\nabla_{z}\log p. So "name the noise" and "point toward the data" are the same request in different units.

Why does this matter for guidance? Because it converts a hard problem about probability densities into easy arithmetic on the vectors the network already outputs. Anything you would want to do to the score (combine two of them, weight one against another, push along a difference) you can do directly to the ϵ\boldsymbol{\epsilon} vectors, then convert back. Guidance is going to be exactly such an arithmetic move.

Plant one caveat now, because it returns as the paper's most honest point. The relation above holds exactly only for the ideal score. A real ϵθ\boldsymbol{\epsilon}_\theta is an unconstrained neural network, and there is no guarantee its output is the gradient of any function at all. A true gradient field cannot swirl; an arbitrary network's output can. So ϵθ\boldsymbol{\epsilon}_\theta only approximates a score, and that gap is what later separates classifier-free guidance from the thing it imitates.

Classifier guidance, and its cost

With the score in hand, classifier guidance is easy to state. You have a conditional diffusion model, whose score points toward plausible images of class cc. You also train a classifier pθ(czλ)p_\theta(c \mid z_\lambda) that reads a noised image and outputs how strongly it looks like class cc. Then you add a multiple of the classifier's gradient to the score, so each step is pulled not just toward "a plausible image" but toward "an image this classifier is sure is a cc":

ϵ~θ(zλ,c)=ϵθ(zλ,c)    wσλzlogpθ(czλ)\tilde{\boldsymbol{\epsilon}}_\theta(z_\lambda, c) = \boldsymbol{\epsilon}_\theta(z_\lambda, c) \;-\; w\,\sigma_\lambda\, \nabla_{z}\log p_\theta(c \mid z_\lambda)

The weight ww is the dial. Following the score-to-noise relation, this is sampling from a tilted distribution that up-weights images the classifier is confident about:

p~θ(zλc)    pθ(zλc)  pθ(czλ)w\tilde{p}_\theta(z_\lambda \mid c) \;\propto\; p_\theta(z_\lambda \mid c)\; p_\theta(c \mid z_\lambda)^{w}

The exponent ww is the strength, and it concentrates mass through what a power does to a probability. Where the classifier is unsure, in the territory two classes share, p(cz)p(c\mid z) is middling and p(cz)wp(c\mid z)^{w} shrinks fast; where it is certain, p(cz)p(c\mid z) is near one and the factor barely moves. So raising ww thins each class out of the contested middle and leaves it on its confident core, away from the other classes. The figure shows this on the paper's own toy, three classes that are plain Gaussian blobs. Slide the weight up and watch each blob pull in and pull apart from the others. The leftmost setting, w=0w=0, is the unguided mixture; the far right is the degenerate end where each class has collapsed to a tight spot and almost all the diversity is gone.

Figure 1 · what guidance does to the distribution
w = 0.0
Three classes, each a Gaussian blob, tinted teal, amber, and violet. At w=0w=0 they overlap into the unguided marginal. Raise the weight and each guided conditional p(zc)p(cz)wp(z\mid c)\,p(c\mid z)^w concentrates and pushes its mass away from the others. Fidelity per sample up, diversity down. This is the paper's Figure 2, made live.

By turning ww up, Dhariwal & Nichol could trade their two quality metrics against each other just like BigGAN's truncation. Everything awkward about the method lives in that extra classifier: it has to read noisy images, so it cannot be a standard pre-trained model; it is a second network to build and store; and steering the generator along its gradient is structurally an adversarial attack on it, which clouds whether the classifier-based metrics are really being earned. (You cannot dodge the noisy-data requirement by denoising first and classifying the recovered xθx_\theta: at high noise that is the blurred average from earlier, too washed out to label reliably, and its gradient would not match the actual noised point being sampled.) The paper's goal is to keep the dial and delete the classifier.

Bayes' rule deletes the classifier

The classifier only ever appears through its gradient, zlogp(czλ)\nabla_{z}\log p(c\mid z_\lambda), so that gradient is the only thing that has to be reproduced. Bayes' rule writes the classifier in terms of the generative model: p(cz)=p(zc)p(c)/p(z)p(c\mid z) = p(z\mid c)\,p(c)/p(z). Take the log and the gradient in zz. The prior p(c)p(c) does not depend on zz, so its gradient is zero and it drops out cleanly, leaving

zlogp(czλ)=zlogp(zλc)    zlogp(zλ)\nabla_{z}\log p(c\mid z_\lambda) = \nabla_{z}\log p(z_\lambda \mid c) \;-\; \nabla_{z}\log p(z_\lambda)

Read that in plain terms: the classifier gradient is the conditional score minus the unconditional score. The direction that makes an image look more like class cc is precisely that difference, the class-c score minus the label-free one. You do not need a classifier to get it. You need two scores, and you already know how to get scores: they are noise predictions. Converting the difference of scores into a difference of noise predictions through logp=ϵ/σ\nabla\log p = -\boldsymbol{\epsilon}/\sigma gives, for the ideal scores,

zlogpi(czλ)=1σλ[ϵ(zλ,c)ϵ(zλ)]\nabla_{z}\log p^{\,i}(c\mid z_\lambda) = -\frac{1}{\sigma_\lambda}\big[\,\boldsymbol{\epsilon}^{*}(z_\lambda, c) - \boldsymbol{\epsilon}^{*}(z_\lambda)\,\big]

where ϵ\boldsymbol{\epsilon}^{*} denotes the exact scores and pi(cz)p(zc)/p(z)p^{\,i}(c\mid z) \propto p(z\mid c)/p(z) is the classifier you get for free by inverting the generator. (Hold onto the star. It marks the place where this stops being exactly true once real networks are involved, which is the next section's point.)

Classifier-free guidance

Now put that implicit-classifier gradient back into classifier guidance. The classifier term was wσλzlogp(cz)-w\sigma_\lambda\nabla_{z}\log p(c\mid z); substituting the difference of scores, the σλ\sigma_\lambda cancels against the 1/σλ-1/\sigma_\lambda and the classifier turns into a difference of noise predictions:

ϵ~θ=ϵθ(zλ,c)wσλ( ⁣1σλ[ϵθ(zλ,c)ϵθ(zλ)] ⁣)=(1+w)ϵθ(zλ,c)    wϵθ(zλ)\tilde{\boldsymbol{\epsilon}}_\theta = \boldsymbol{\epsilon}_\theta(z_\lambda, c) - w\sigma_\lambda\Big(\!-\tfrac{1}{\sigma_\lambda}\big[\boldsymbol{\epsilon}_\theta(z_\lambda,c) - \boldsymbol{\epsilon}_\theta(z_\lambda)\big]\!\Big) = (1+w)\,\boldsymbol{\epsilon}_\theta(z_\lambda, c) \;-\; w\,\boldsymbol{\epsilon}_\theta(z_\lambda)
(6)

Both predictions, the conditional ϵθ(zλ,c)\boldsymbol{\epsilon}_\theta(z_\lambda, c) and the unconditional ϵθ(zλ)\boldsymbol{\epsilon}_\theta(z_\lambda), come from the same network, run with the label present or hidden. To see what the combination does, regroup it two ways:

ϵ~θ=ϵθ(z,c)+w[ϵθ(z,c)ϵθ(z)]amplify the difference=ϵθ(z)+(1+w)[ϵθ(z,c)ϵθ(z)]start unconditional, overshoot\tilde{\boldsymbol{\epsilon}}_\theta = \underbrace{\boldsymbol{\epsilon}_\theta(z,c) + w\big[\boldsymbol{\epsilon}_\theta(z,c) - \boldsymbol{\epsilon}_\theta(z)\big]}_{\text{amplify the difference}} = \underbrace{\boldsymbol{\epsilon}_\theta(z) + (1+w)\big[\boldsymbol{\epsilon}_\theta(z,c) - \boldsymbol{\epsilon}_\theta(z)\big]}_{\text{start unconditional, overshoot}}

The first form is the one to remember: take the conditional prediction and push it further in the direction that the label changed it. The second form shows the geometry, an extrapolation. Start at the unconditional guess, draw the arrow to the conditional guess, then keep going past it, by a factor 1+w1+w. Guidance does not blend the two predictions; it overshoots beyond the conditional one. At w=0w=0 the overshoot is zero and you are back to ordinary conditional sampling.

A number makes the overshoot concrete. Suppose at one coordinate the conditional prediction is ϵθ(z,c)=0.20\boldsymbol{\epsilon}_\theta(z,c) = 0.20 and the unconditional one is ϵθ(z)=0.50\boldsymbol{\epsilon}_\theta(z) = 0.50. The difference is 0.30-0.30. At a typical w=3w = 3, the guided value is 0.20+3(0.30)=0.700.20 + 3(-0.30) = -0.70, well outside the interval between the two predictions. The guided sample is being pushed harder toward "clearly class cc" than either model alone asked for.

Drag the figure. The probe sits in the same three-class world; the gray arrow is the unconditional score, the teal arrow is the conditional score for the chosen class, and the bold bright arrow is the guided combination. The amber dashed segment is the guidance vector ϵθ(z,c)ϵθ(z)\boldsymbol{\epsilon}_\theta(z,c) - \boldsymbol{\epsilon}_\theta(z), the part being amplified. At w=0w=0 the bold arrow lies on the teal one; raise ww and it extends along the amber direction, away from the other classes. Switch the target class to watch the guidance vector swing to point at a different blob.

Figure 2 · guidance as extrapolation
w = 2.0
At the draggable probe, three directions on one shared scale: the unconditional score, the conditional score, and the bold guided combination (1+w)ϵ(z,c)wϵ(z)(1{+}w)\,\boldsymbol{\epsilon}(z,c) - w\,\boldsymbol{\epsilon}(z). The amber dashed segment is the difference being amplified. Raising ww overshoots past the conditional arrow. We draw the score (the pull toward data); the noise prediction is its negative, so the same mix applies.

Getting both predictions out of one network is the part that costs almost nothing. During training you simply replace the label with a null token \varnothing some fraction puncondp_\text{uncond} of the time. When the label is present the network learns ϵθ(z,c)\boldsymbol{\epsilon}_\theta(z, c); when it is \varnothing the same weights learn the unconditional ϵθ(z)=ϵθ(z,c=)\boldsymbol{\epsilon}_\theta(z) = \boldsymbol{\epsilon}_\theta(z, c{=}\varnothing). No second model, no extra parameters, one shared set of weights doing both jobs. The figure shows a batch with the label dropped at rate puncondp_\text{uncond}; slide it and watch the share of \varnothing tiles track the dial, and watch the honest endpoints: at 00 the network never learns the unconditional prediction and guidance is impossible, at 11 it never sees a label.

Figure 3 · one network learns both predictions
0.10
Joint training (Algorithm 1). Each example keeps its label or, with probability puncondp_\text{uncond}, is replaced by the token and trains the model unconditionally. One shared network learns both ε(z,c) and ε(z). The paper finds puncond{0.1,0.2}p_\text{uncond}\in\{0.1, 0.2\} works best; only a small slice of capacity needs to go to the unconditional task.

The two lines that change, then, look like this at training time:

# Algorithm 1: joint training, one step
x, c = sample_batch()               # an image and its label
if rand() < p_uncond:               # ~10-20% of the time...
    c = NULL                        # ...drop the label (the ∅ token)
lam  = sample_log_snr()             # noise level  λ ~ p(λ)
eps  = randn_like(x)                # the noise we will add
z    = alpha(lam)*x + sigma(lam)*eps   # the noised input
loss = mse(eps_theta(z, lam, c), eps)  # one plain MSE on the noise
loss.backward(); opt.step()         # SAME weights learn c and ∅

and at sampling time:

# Algorithm 2: sampling with guidance weight w
z = randn(shape)                    # start from pure noise
for lam in schedule(lam_min, lam_max):   # low SNR -> high SNR
    e_c = eps_theta(z, lam, c)      # conditional pass
    e_u = eps_theta(z, lam, NULL)   # unconditional pass
    e   = e_c + w*(e_c - e_u)       # Eq (6): amplify the difference
    x   = (z - sigma(lam)*e)/alpha(lam)  # implied clean image
    z   = reverse_step(z, x, lam)   # one ancestral (or DDIM) step
return x

The system end to end: train one network with conditioning dropout, then sample by running it twice per step and extrapolating between the two predictions. The paper is right to call the diversity-for-fidelity dial what it always was, a one-line change at each end.

Why it isn't really a classifier

Equation (6) was derived from an implicit classifier, so it is tempting to say classifier-free guidance is just classifier guidance with the classifier computed by Bayes' rule. The paper is careful to say it is not, and the reason is the star from two sections ago.

The implicit-classifier identity holds for the exact scores ϵ\boldsymbol{\epsilon}^{*}. The deployed method uses the learned ϵθ\boldsymbol{\epsilon}_\theta, two outputs of an unconstrained neural network. Their difference ϵθ(z,c)ϵθ(z)\boldsymbol{\epsilon}_\theta(z,c) - \boldsymbol{\epsilon}_\theta(z) is generally not the gradient of any function, because a gradient field cannot swirl and an arbitrary network's output can. So there is, in general, no classifier p(cz)p(c\mid z) whose gradient equals what classifier-free guidance follows. The method is inspired by classifier guidance and behaves like it, but it is not literally guiding along any classifier's gradient, which also means it cannot be dismissed as an adversarial attack on a classifier, since there is no classifier present to attack.

This is not a small print disclaimer; it is why the approach rests on slightly informal footing. Inverting a generator with Bayes' rule is not guaranteed to give a useful classifier (Grandvalet & Bengio found discriminative classifiers usually beat ones derived this way), and under a misspecified model the derived classifier can be inconsistent. The justification for classifier-free guidance is ultimately that it works, not that the Bayes story is exact. The next sections are where "it works" gets pinned to numbers.

The off-by-one: w vs scale s

One practical snag will trip you the moment you compare the paper to code. The paper's weight ww starts at 00 for no guidance. The guidance_scale in Stable Diffusion, Imagen, and the HuggingFace diffusers library starts at 11 for no guidance, because they write the same equation in the equivalent form

ϵ~=ϵθ(z)+s[ϵθ(z,c)ϵθ(z)],s=w+1\tilde{\boldsymbol{\epsilon}} = \boldsymbol{\epsilon}_\theta(z) + s\,\big[\boldsymbol{\epsilon}_\theta(z, c) - \boldsymbol{\epsilon}_\theta(z)\big], \qquad s = w + 1

So the paper's w=1w=1 is the code's s=2s=2, the paper's w=3w=3 is s=4s=4, and Stable Diffusion's default of about 7.57.5 corresponds to a paper weight of about 6.56.5, far past anything the experiments measured. Any time you see a guidance number, check which convention it is in, or your comparison is silently off by one.

One more +1+1 floats around this paper and is a different statement, so do not merge them. Inside classifier guidance, applying weight w+1w+1 to an unconditional model is equivalent to applying weight ww to a conditional one, because p(zc)p(cz)wp(z)p(cz)w+1p(z\mid c)\,p(c\mid z)^{w} \propto p(z)\,p(c\mid z)^{w+1}. That is a fact about which base model you guide, not about the code convention above. Dhariwal & Nichol still got their best results guiding an already-conditional model, and this paper stays in that setup.

What guidance buys

The experiments are class-conditional ImageNet at 64×64 and 128×128, sweeping ww and reading two standard metrics. They point in opposite directions, which is exactly what the guidance sweep trades between. Inception Score (IS, higher is better) feeds samples to a fixed classifier and rewards images it labels confidently and a spread of labels across samples; it never looks at real data. FID (Fréchet Inception Distance, lower is better) models the generated and the real images as two clouds of classifier features and compares them on two counts: the distance between their means, and the distance between their covariances, which measure how spread out each cloud is. That second, covariance term is what punishes lost diversity: a model collapsed onto a few confident outputs has a much tighter feature cloud than real data, so the covariance gap (and the FID) grows. IS rises with fidelity per sample; FID rises when you drift off the real distribution. Both are computed from 50,000 samples.

Sweep ww in the figure and the trade is visible point by point. Inception Score climbs the entire way, from 53.753.7 unguided to 260260 at w=4w=4 on 64×64, the classifier labeling every sample with near-certainty. FID does something more interesting, and here the paper's prose and its own tables disagree. The text says FID decreases monotonically with ww; the numbers say FID is U-shaped. A little guidance helps it (64×64 FID falls from 1.801.80 at w=0w=0 to its best of 1.551.55 at w=0.1w=0.1), and after that it climbs steadily, all the way to 2626 at w=4w=4 as the samples abandon the data's diversity. We teach the U-shape, because it is what the experiments report.

Figure 4 · the fidelity / diversity trade, measured
w = 0.1
Each point is one guidance weight from the paper's sweep, plotted by Inception Score (right is more confident) against FID (down is closer to real data). Inception Score rises with ww across the entire sweep; FID dips to a minimum at a small ww (the amber point), then climbs. Toggle between ImageNet 64×64 and 128×128; the best-FID weight shifts with the dataset. Numbers are verbatim from Tables 1 and 2.

The best fidelity-quality point sits at a small weight, and where it sits depends on the dataset: w=0.1w=0.1 on 64×64, w=0.3w=0.3 on 128×128. Push past that and you are spending real diversity for ever more confident, ever-more-stereotyped samples, which at high ww also come out with visibly over-saturated colors, a concrete artifact the paper flags. At its best-FID setting on 128×128 the model reached an FID of 2.432.43 (at w=0.3w=0.3), state of the art in the literature when the paper appeared and below the 2.972.97 of the classifier-guided ADM-G it set out to replace, with no extra classifier to train.

Two more findings round it out. The dropout rate puncondp_\text{uncond} barely matters in the working range: 0.10.1 and 0.20.2 perform about equally, and only 0.50.5 is clearly worse, so a small slice of the network's capacity spent on the unconditional task is enough. And the honest cost: every sampling step runs the network twice, once conditional and once unconditional. Matched for compute against classifier-guided ADM-G, which uses a single pass with a small classifier, classifier-free guidance at the comparable step budget actually trails on FID. You pay for the deleted classifier in forward passes, and the paper suggests injecting the conditioning late in the network so the two passes could share most of their work.

What remains is a method whose appeal is exactly its plainness. There is no auxiliary model, no adversarial gradient, no extra parameters, just a generator run twice and a subtraction with a knob on it. A diffusion model already contains its own classifier, in the difference between knowing the label and not, and you can lean on that difference as hard as you like. Every guidance slider you have ever moved is that subtraction, scaled by ww.

Provenance Verified against primary literature
Ho & Salimans (2022)The paper: classifier-free guidance, ε̃ = (1+w)ε(z,c) − w·ε(z).
Dhariwal & Nichol (2021)Classifier guidance, the baseline this replaces; needs a noise-aware classifier.
Ho et al. (2020) — DDPMThe ε-prediction parameterization and the plain-MSE training objective.
Kingma et al. (2021) — VDMThe variance-preserving, log-SNR (λ) convention used here.
Heusel 2017 / Salimans 2016FID and Inception Score, the diversity and fidelity metrics.
correctionThe paper's prose (Sec 4.1) says FID decreases monotonically with the weight w; its own tables show FID is U-shaped, best at a small w and then worsening. We teach the U-shape and flag the discrepancy. We also map the paper's w to the guidance scale s = w+1 you type into Stable Diffusion.

Questions you might still have

?

Is this the same as the guidance scale slider in Stable Diffusion?
Yes, with an off-by-one. The paper's weight w runs from 0 (no guidance), while the diffusers/Stable-Diffusion guidance_scale s runs from 1 (no guidance), and s = w + 1. Stable Diffusion's default of about 7.5 is the paper's w ≈ 6.5, far past anything the paper measured.

?

If there is no classifier, what is the guidance actually doing?
It amplifies the part of the noise prediction that the label changes: ε(z,c) − ε(z). At every step you take the conditional prediction and push it further in the direction that knowing the label moved it, away from the generic unconditional guess.

?

Why does cranking guidance up too far make images worse, not just less varied?
Both happen. Inception Score keeps rising with w, but FID bottoms out at a small w and then climbs, because the samples drift off the real data distribution. The paper also notes strongly guided samples come out with over-saturated colors.

?

Why run the model twice on every sampling step?
You need both ε(z,c) and ε(z), and they come from the same network with the label present or replaced by ∅. That is two forward passes per step, the main cost of the method. The paper suggests injecting the conditioning late in the network as a way to share most of the work.

?

Does guidance change training, or only sampling?
Mostly sampling. Training changes by one line: drop the label with probability p_uncond so one network learns both the conditional and unconditional predictions. The guidance weight w is chosen at sampling time and needs no retraining.

Footnotes & further reading

  1. The paper: Jonathan Ho & Tim Salimans, Classifier-Free Diffusion Guidance (Google Research, 2022; a short version appeared at the NeurIPS 2021 Workshop on Deep Generative Models).
  2. Classifier guidance, the baseline replaced here: Prafulla Dhariwal & Alex Nichol, Diffusion Models Beat GANs on Image Synthesis (2021).
  3. The ε-prediction parameterization and the plain-MSE objective: Ho, Jain & Abbeel, Denoising Diffusion Probabilistic Models (2020). The score / SDE view that ϵθσlogp\boldsymbol{\epsilon}_\theta \approx -\sigma\,\nabla\log p: Song et al., Score-Based Generative Modeling through SDEs (2021).
  4. The variance-preserving, log-SNR convention: Kingma, Salimans, Poole & Ho, Variational Diffusion Models (2021).
  5. The metrics: Salimans et al., Improved Techniques for Training GANs (Inception Score, 2016), and Heusel et al., GANs Trained by a Two Time-Scale Update Rule (FID, 2017).
  6. The free fidelity dials this set out to match: Brock et al., Large Scale GAN Training (BigGAN truncation, 2019), and Kingma & Dhariwal, Glow (temperature sampling, 2018). Classifier-free guidance is what powers the guidance_scale in Stable Diffusion and its descendants; the modern alternative, flow matching, guides the same way.
  7. The s=w+1s = w + 1 convention: HuggingFace diffusers and the Imagen/Stable Diffusion samplers all use ϵu+s(ϵcϵu)\boldsymbol{\epsilon}_u + s(\boldsymbol{\epsilon}_c - \boldsymbol{\epsilon}_u).