Diffusion · Generative models

High-Resolution Image Synthesis with Latent Diffusion Models

Diffusion got expensive. Move it off the pixels.

A pretrained autoencoder strips the imperceptible detail out of an image and leaves a small latent. Run the diffusion there instead of on pixels, add a cross-attention port for text, and high-resolution synthesis stops needing a data center.

Explaining the paperHigh-Resolution Image Synthesis with Latent Diffusion ModelsRombach, Blattmann, Lorenz, Esser, Ommer · LMU Munich & Runway · CVPR 2022 · arXiv:2112.10752 ↗

Most of the bits in an image are detail. The expensive model never needed to learn them.

By the end of 2021 the sharpest image generators were diffusion models. The recipe, the subject of our DDPM explainer, is plain: add Gaussian noise to an image over many small steps until it is static, train one network to undo a single step, then sample by running that network backward from noise. It makes beautiful pictures and trains stably, with none of the mode collapse that made GANs miserable. It also has a cost problem that had started to define the field.

That network runs on the image itself, a tensor with hundreds of thousands of numbers, and it has to run once per denoising step, hundreds of steps per image, for both training and sampling. Training the strongest pixel-space diffusion models took between 150 and 1000 V100-GPU-days (the V100 being NVIDIA's then-standard datacenter GPU). Producing 50,000 samples took about five days on a single A100. The quality was there. The price meant only a handful of labs could pay it.

Latent Diffusion, out of the CompVis group in Munich together with Runway, is the paper that cut the price without giving up the quality. The idea is simple: do not run the diffusion on the pixels. Run it on a small, compressed code of the image, produced once by a separate autoencoder, and decode the result back to a picture at the very end. The same paper added a clean way to steer the generator with text or a layout, and it became the architecture behind Stable Diffusion. To see why it works, and why the compression does not wreck the images, we have to look at where an image actually spends its bits.

Where an image spends its bits

In any natural image, the information is unevenly distributed. Most of it is high-frequency detail: the exact texture of grass, the grain in a photograph, the precise value of one pixel against its neighbor. Nudge those details and a person cannot tell. They carry bits, and the bits cost a model capacity to learn, but they carry little meaning.

The authors make this concrete with a rate-distortion picture (their Figure 2). It plots how much an image is distorted, in a perceptual sense, against the rate: the number of bits you keep when you compress it. The curve is steep, then flat. The first few bits buy enormous drops in distortion, because they encode the things that matter: the layout and the objects. After an elbow, each extra bit barely changes anything you would notice. Those late bits are the imperceptible detail.

That splits compression into two regimes. Up to the elbow is semantic compression: the bits that determine what the image is. Past it is perceptual compression: detail you can drop with no visible loss. A pixel-space diffusion model treats both regimes the same. It spends its capacity, and your GPU-days, modeling every bit to the same standard, most of which the eye discards. Drag the operating point and watch the trade:

Figure 1 · the rate-distortion split

Perceptual distortion against the rate kept in the latent. The first bits (left, steep) are semantic and the diffusion model must learn them; the later bits (right, flat) are imperceptible detail the encoder can drop. LDM cuts at the elbow: keep the meaning, discard what the eye does not see.

The plan follows directly. Put a cheap, dedicated model at the elbow to do the perceptual compression once, and let the expensive diffusion model work only on the bits that carry meaning. That cheap model is an autoencoder.

Stage one: a perceptual autoencoder

An autoencoder is two networks. An encoder $\mathcal{E}$ compresses an image to a smaller representation. A decoder $\mathcal{D}$ rebuilds the image from it. Train them together so the rebuild matches the original, and the representation in the middle has to capture whatever the decoder needs. Write the image as $x$ , a tensor of shape $H \times W \times 3$ (height, width, three color channels). The encoder produces a latent $z = \mathcal{E}(x)$ , and the decoder reconstructs $\tilde{x} = \mathcal{D}(z)$ :

x \in \mathbb{R}^{H \times W \times 3}, \qquad z = \mathcal{E}(x) \in \mathbb{R}^{h \times w \times c}, \qquad \tilde{x} = \mathcal{D}(\mathcal{E}(x))

The latent $z$ has shape $h \times w \times c$ . The number that controls everything is the downsampling factor $f = H/h = W/w$ : how many times smaller each spatial side gets. At $f = 4$ a 256 × 256 image becomes a 64 × 64 latent; at $f = 8$ , a 32 × 32 one. The channel count $c$ is small, 3 or 4 in the good configurations. So the latent is a small image. It keeps the two-dimensional grid, only shrunk. A bigger $f$ means a smaller latent and a cheaper diffusion loop, but past some point the autoencoder has to discard structure rather than detail, and much of the paper comes down to choosing where those two pressures cross; Figure 4 below plots that crossing directly.

The latent stays laid out as a small spatial grid. It is not flattened into a one-dimensional string of code numbers. Earlier two-stage methods, VQGAN and DALL-E, also compressed images first, but to feed an autoregressive Transformer (one that generates the code one token at a time, each conditioned on the last) they flattened the latent into a 1D sequence and predicted it left to right. A 1D ordering throws away the fact that nearby pixels are related, and a Transformer over a long sequence is costly, so those methods needed aggressive compression (VQGAN at $f = 16$ , DALL-E at $f = 8$ ) and grew to billions of parameters. Latent Diffusion keeps the 2D grid because the diffusion model's generator, a convolutional U-Net, is built to work on grids. Keeping the 2D grid lets it compress gently, at $f = 4$ or $8$ , and still reconstruct faithfully.

How faithful? At $f = 4$ the reconstruction has a PSNR of 27.4 dB and a reconstruction-FID of 0.58 (computed on ImageNet-val). (PSNR is pixel-level fidelity in decibels, higher is better; reconstruction-FID compares the distribution of reconstructed images to real ones, lower is better.) At $f = 8$ it is 23.1 dB and 1.14. Push to $f = 16$ and $32$ and the numbers fall off as real detail starts to vanish. Drag $f$ and watch the latent shrink and the reconstruction degrade:

Figure 2 · the autoencoder, and what f costs

An image is encoded to a small spatial latent and decoded back. As f rises the latent grid shrinks and the reconstruction loses first fine texture, then structure; the R-FID and PSNR readouts (the paper's VQ autoencoder zoo) worsen with it. At f = 4 or 8 the rebuild is accurate.

How the autoencoder is trained matters as much as its shape, and it is not the pixel-error objective you might expect.

Keeping the latent from drifting

Train an autoencoder with a pixel-wise $L_2$ loss and you get blurry reconstructions. Averaging is the safe bet when you are scored pixel by pixel, and the average of plausible textures is mush. Latent Diffusion borrows VQGAN's first-stage recipe instead, with two ingredients that keep reconstructions crisp.

The first is a perceptual loss. Instead of comparing pixels, compare the two images in the feature space of a pretrained network. This is LPIPS, from a paper aptly titled The Unreasonable Effectiveness of Deep Features as a Perceptual Metric, and it tracks human judgments of similarity far better than pixel distance. The second is a patch-based adversarial loss. A small discriminator (a PatchGAN, from the pix2pix paper) inspects local patches of the reconstruction and tries to tell them from real ones, and the autoencoder is trained to fool it. That forces local realism: textures that look right up close, not only on average.

Together these keep reconstructions on the image manifold, the paper's phrase for "looks like a real image," and dodge the blur of pixel losses. The full objective is a min-max between the autoencoder $(\mathcal{E}, \mathcal{D})$ and the patch discriminator $D_\psi$ :

L_{\text{Autoencoder}} = \min_{\mathcal{E},\mathcal{D}}\,\max_{\psi}\Big( L_{rec}\big(x, \mathcal{D}(\mathcal{E}(x))\big) - L_{adv}\big(\mathcal{D}(\mathcal{E}(x))\big) + \log D_\psi(x) + L_{reg}(x;\mathcal{E},\mathcal{D}) \Big)

(25)

Read the terms left to right. $L_{rec}$ is the reconstruction loss (the perceptual one plus a pixel term). The middle pair is the adversarial game: the discriminator $D_\psi$ learns to score real images high and reconstructions low, and the autoencoder pushes the other way. $L_{reg}$ is a regularizer, and it is doing something specific.

Without a regularizer the encoder could place its latents anywhere, at any scale, and an arbitrarily spread-out latent space is a poor place to train a second model. So the paper pins the latent down, with one of two light touches.

KL-regularization adds a small penalty pulling the latent toward a standard normal, the same term a variational autoencoder uses. The word "small" is load-bearing: the weight is about $10^{-6}$ , a thousandth of a thousandth. This is not a real VAE prior, which would trade reconstruction quality for a tidy latent. It is enough to keep the latent from drifting to extreme values, and the autoencoder goes on reconstructing accurately. The tension is two-sided: a strong prior would buy a tidy Gaussian latent at the price of reconstruction fidelity, no prior at all would let the latent drift to arbitrary scales, and the tiny weight is the compromise that pins the latent scale while leaving reconstruction fidelity essentially unaffected. VQ-regularization instead snaps each latent vector to the nearest entry in a learned codebook, the vector-quantization trick from VQ-VAE. The paper folds the quantization step into the decoder, so this variant is, in its words, a VQGAN with the quantization layer absorbed by the decoder. Either regularizer works, and the choice barely changes the downstream samples.¹

Stage two: diffuse in the latent

With the autoencoder trained and then frozen, meaning its weights are no longer updated, the second stage is a diffusion model, and here Latent Diffusion changes exactly one thing about DDPM.

A diffusion model learns a data distribution by reversing a fixed noising process. (The mechanics are in our DDPM explainer; the short version follows.) You add Gaussian noise to a sample in many small steps until it is indistinguishable from noise, then train a single network $\epsilon_\theta$ to look at a noised sample and predict the noise that was added. Knowing the noise is the same as knowing the clean sample, so a trained $\epsilon_\theta$ can walk noise back into data. The training loss is a plain mean-squared error between the true noise and the prediction:

L_{DM} = \mathbb{E}_{x,\,\boldsymbol{\epsilon}\sim\mathcal{N}(0,1),\,t}\Big[\,\big\lVert \boldsymbol{\epsilon} - \epsilon_\theta(x_t, t) \big\rVert_2^2\,\Big]

(1)

In a pixel-space model $x_t$ is a noised image and $\epsilon_\theta$ runs on the full $H \times W \times 3$ tensor. Latent Diffusion runs the identical objective, but on the latent. Encode the image once to $z = \mathcal{E}(x)$ , noise the latent, and train $\epsilon_\theta$ to denoise $z_t$ :

L_{LDM} = \mathbb{E}_{\mathcal{E}(x),\,\boldsymbol{\epsilon}\sim\mathcal{N}(0,1),\,t}\Big[\,\big\lVert \boldsymbol{\epsilon} - \epsilon_\theta(z_t, t) \big\rVert_2^2\,\Big]

(2)

The two equations differ by one symbol, $x_t$ becoming $z_t$ , and that one substitution is the only change. The denoiser $\epsilon_\theta$ is still a U-Net, the architecture DDPM uses, and because the latent is a 2D grid the U-Net's convolutions apply directly. (The U-Net is the subject of its own explainer.) Its attention layers carry over unchanged, because $z$ is still a small spatial grid rather than a flattened code; the only thing that moved is how many cells each denoising step has to touch. But now it runs on a tensor that is $f^2$ times smaller in area. At $f = 8$ that is a 64-fold cut in the spatial size the expensive network has to process, every step, for training and sampling alike. The training step is the one you already know, with the encode line bolted on the front:

# stage two, pixel-space DM: the U-Net runs on the whole image
x    = sample_image()              # [H, W, 3]   e.g. 256x256x3
t    = randint(1, T)
eps  = randn_like(x)               # the noise the net must predict
x_t  = noise(x, eps, t)            # forward-noise the image (DDPM)
loss = mse(eps, eps_theta(x_t, t)) # eq (1)

# latent diffusion: encode ONCE with the frozen E, then diffuse z
z    = E(x)                        # [h, w, c]   e.g. 32x32x4
eps  = randn_like(z)
z_t  = noise(z, eps, t)            # same forward process, on the latent
loss = mse(eps, eps_theta(z_t, t)) # eq (2)  -> f^2 fewer spatial cells

The two-stage shape helps twice over. Sampling runs the denoiser over the small latent for all its steps, then decodes once with $\mathcal{D}$ to get a full-resolution image, so the heavy sequential loop never touches pixels. And the autoencoder is universal: train it once, and reuse the same latent space for many different diffusion models and tasks. The system, assembled:

Figure 3 · the latent-diffusion architecture

On the left, pixel space: the encoder ℰ maps an image into the latent, the decoder 𝒟 maps it back. The big panel is the latent space, where a time-conditional U-Net

\epsilon_\theta

denoises the small latent over T steps. Conditioning y enters through cross-attention. ℰ and 𝒟 are each used once; the loop never touches pixels. The strip below counts the per-step work: a step on pixels would touch

256^2 = 65{,}536

cells, the same step in the latent touches

32^2 = 1{,}024

, the 64-fold cut at

f = 8

Because the forward noising is fixed, $z_t$ can be produced from $\mathcal{E}(x)$ on the fly during training, and a finished sample needs a single pass through $\mathcal{D}$ . The encoder and decoder are each used once; everything in between happens in the small latent.

How much to compress

$f$ is the dial that sets everything, and it has a sweet spot. At $f = 1$ there is no compression at all: this is pixel-space diffusion, the slow and expensive baseline the paper is trying to beat. At the other end, a large $f$ means a tiny latent, which sounds efficient until you remember the autoencoder had to throw away real content to get there. Too much compression and the reconstruction quality, the ceiling on everything downstream, drops.

In between is a basin. The paper trains class-conditional models at $f \in \{1,2,4,8,16,32\}$ on ImageNet under a fixed compute budget and watches the sample quality, scored by FID. (This is sample-quality FID, how close the generated images are to real ones; it is the stage-two cousin of the reconstruction-FID from before, same idea, lower is better.) Pixel diffusion ( $f = 1$ ) trains slowly and lands far behind; after two million steps the gap to $f = 8$ is about 38 FID. Over-compression ( $f = 16$ , $32$ ) stalls early at a worse ceiling. $f = 4$ and $f = 8$ sit at the bottom of the U, and the paper picks them for most of its models. Scrub $f$ and watch quality fall off on both sides:

Figure 4 · the compression sweet spot

Sample quality (FID, lower is better) against the downsampling factor f after equal compute, after the paper's Figure 6. f = 1 (pixels) wastes capacity on detail and trains slowly; f = 16, 32 over-compress and hit a ceiling; f = 4 and 8 are the basin. The vertical axis is schematic, anchored to the paper's reported 38-FID gap between f = 1 and f = 8.

There is a second axis the figure does not show: speed. A larger $f$ always trains and samples faster, because the latent is smaller. So the real choice is $f = 4$ for the best quality, or $f = 8$ for a smaller, faster latent at a small cost in quality. The complex datasets like ImageNet want the gentler $f = 4$ ; the paper's text-to-image model uses $f = 8$ .

Steering it with cross-attention

The second half of the paper makes the diffusion model take direction, from text or a class label or a segmentation map, through one general mechanism.

A diffusion model is conditionable in principle: feed the denoiser the conditioning $y$ alongside the noised input and write $\epsilon_\theta(z_t, t, y)$ . The question is how to feed in something like a sentence, whose shape has nothing to do with the image grid. The real problem is the shape mismatch: a latent is a grid, a prompt is a sequence. Cross-attention, the operation that lets a Transformer relate two different sequences, fits exactly here. It lets each grid location query the sequence for what matters there, which neither concatenation (the shapes do not line up) nor a single global summary vector (all spatial control lost) can do. First, run the conditioning through a domain-specific encoder $\tau_\theta$ that turns it into a set of vectors $\tau_\theta(y)$ . For text, $\tau_\theta$ is a Transformer over the tokenized prompt; for a class label it can be a lookup. Then, inside the U-Net, attention links the image to it:

\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{Q\,K^{\top}}{\sqrt{d}}\right) V, \qquad Q = W_Q\,\varphi(z_t),\quad K = W_K\,\tau_\theta(y),\quad V = W_V\,\tau_\theta(y)

The query $Q$ is built from the image side, a flattened intermediate feature map $\varphi(z_t)$ of the U-Net. The keys and values $K, V$ are built from the condition $\tau_\theta(y)$ . So each image location is matched against the condition, and the most relevant condition vectors weigh heaviest: a patch that is becoming a squirrel attends to the word "squirrel" and pulls in its value. Hover over the image grid and watch which words a patch attends to:

Figure 5 · cross-attention conditioning

The query Q comes from the U-Net's image features

\varphi(z_t)

; the keys and values K, V come from the prompt through

\tau_\theta

. Line weight is the softmax attention from the active patch to each word. A high patch leans on "squirrel," a low one on "burger."

The conditioning encoder $\tau_\theta$ and the denoiser $\epsilon_\theta$ are trained together, end to end, on image-condition pairs. The loss is the latent objective from before with $y$ threaded through:

L_{LDM} = \mathbb{E}_{\mathcal{E}(x),\,y,\,\boldsymbol{\epsilon}\sim\mathcal{N}(0,1),\,t}\Big[\,\big\lVert \boldsymbol{\epsilon} - \epsilon_\theta\big(z_t, t, \tau_\theta(y)\big) \big\rVert_2^2\,\Big]

(3)

For spatially-aligned conditions where the control is itself a grid (a low-resolution image for super-resolution, a mask for inpainting, a semantic map), the paper skips the attention and concatenates the condition onto the denoiser's input, which is cheaper and works because the grids line up.

Classifier-free guidance sharpens the conditional samples. During training the condition is randomly dropped, so the same network learns both a conditional and an unconditional denoiser. At sampling you extrapolate away from the unconditional prediction toward the conditional one,

\hat{\boldsymbol{\epsilon}} = \epsilon_\theta(z_t, t) + s\,\big(\epsilon_\theta(z_t, t, \tau_\theta(y)) - \epsilon_\theta(z_t, t)\big)

The formula combines two predictions. At every denoising step the network runs twice: once with the prompt, $\epsilon_\theta(z_t, t, \tau_\theta(y))$ , and once without it, $\epsilon_\theta(z_t, t)$ . Their difference is the component of the prediction that depends on the prompt, the part of the prediction that exists only because of the condition. The scale $s$ says how hard to follow that pull: $s = 1$ is the ordinary conditional prediction, and a larger $s$ pushes further along the prompt's direction than the model alone would go, sharpening how closely the sample obeys the text. Push it too far and the extrapolation overshoots into oversaturated, less varied images, which is the diversity-for-fidelity trade. (This is distinct from the older classifier guidance, which needs a separate image classifier and steers by its gradient.) The paper's reported benchmark numbers use a modest $s = 1.5$ .²

Cheap, and still competitive

Latent Diffusion buys efficiency without a quality tax. Every model in the paper trains on a single A100, against the 150-to-1000 V100-days the pixel-space competition spent, and the samples are competitive or better.

On class-conditional ImageNet, the $f = 4$ model scores FID 10.56, and with guidance FID 3.60 at 400M parameters, ahead of the much larger ADM (the pixel-space "Diffusion Models Beat GANs" model from Dhariwal & Nichol; FID 10.94 at 554M, or 4.59 with its classifier guidance at 608M). On unconditional CelebA-HQ (a high-resolution face dataset) it sets a new state-of-the-art FID of 5.11. On text-to-image over MS-COCO the 1.45B-parameter model reaches FID 12.63 with guidance, on par with two much larger contemporary text-to-image models, GLIDE (6B parameters) and Make-A-Scene (4B), at a fraction of the size. On inpainting it sets a new state of the art and runs at least 2.7 times faster than the pixel-space equivalent while scoring better.

The text-to-image model made this paper famous. The paper trains a 1.45B-parameter latent diffusion model on LAION-400M, with text fed in through a Transformer $\tau_\theta$ trained from scratch on a BERT tokenizer. It is not Stable Diffusion. Stable Diffusion, released the following year, is a latent diffusion model in the same family, trained at much larger scale, with one swap: the from-scratch text Transformer is replaced by a frozen, pretrained CLIP text encoder (the ViT-L/14 variant). The architecture in this paper made Stable Diffusion possible; the household name is a later, scaled-up instance of it.

Two limits remain. Sampling is still a sequential loop and slower than a GAN's single forward pass. And the autoencoder's reconstruction sets a hard ceiling: for tasks that need exact pixels, that $f = 4$ reconstruction, good as it is, can become the bottleneck. The paper names super-resolution as the case where this bites.

The idea is small. Most of an image is imperceptible detail; an autoencoder removes it once, and the expensive diffusion model runs only on what remains, with a cross-attention port for text or layout. The diffusion math is unchanged, but moving it into the latent space put image generation on a consumer GPU.

Provenance Verified against primary literature

DDPM (2020)The forward/reverse diffusion and the noise-prediction loss that LDM runs in latent space. See our DDPM explainer.

VQGAN / VQ-VAEThe perceptual + patch-adversarial first stage and the vector-quantization codebook (Esser et al. 2021; van den Oord et al. 2017).

LPIPS / PatchGANThe perceptual loss (Zhang et al. 2018) and the patch discriminator (Isola et al. 2017) that keep reconstructions on the image manifold.

Classifier-free guidanceHo & Salimans (2021): the guidance that sharpens conditional samples, at scale s = 1.5 for the reported numbers.

correctionLatent Diffusion is not the same model as Stable Diffusion. This paper’s text-to-image LDM trains a BERT-tokenizer and a Transformer τθ from scratch on LAION-400M; Stable Diffusion (the later release) is a larger LDM in the same family that swaps in a frozen, pretrained CLIP text encoder. Also: the reported FID/IS numbers use classifier-free guidance scale s = 1.5, while the qualitative samples in the paper’s Figure 5 use s = 10.0.

Questions you might still have

Is Latent Diffusion the same thing as Stable Diffusion?
No, though they are close kin. This paper’s text-to-image model is a 1.45B-parameter LDM trained on LAION-400M, with text encoded by a Transformer trained from scratch on a BERT tokenizer. Stable Diffusion, released the following year, is a latent diffusion model in the same family trained at much larger scale, with the text encoder swapped for a frozen, pretrained CLIP ViT-L/14. The architecture here made Stable Diffusion possible; the famous product is a scaled-up instance of it with a different text encoder.

If the autoencoder throws detail away, why aren’t the samples blurry?
Because the discarded bits are perceptually irrelevant by construction. The autoencoder is trained with a perceptual loss and a patch discriminator, not pixel error, so its reconstructions stay on the image manifold rather than averaging into mush. At f = 4 the reconstruction FID is 0.58 and PSNR 27.4 dB; the eye cannot tell. The loss only shows up for tasks that need exact pixels.

Why not compress harder to make the diffusion even cheaper?
Past the elbow of the rate-distortion curve you start deleting semantic content, not detail alone. The paper’s f = 16 and f = 32 models stall at a worse quality ceiling because the autoencoder threw away too much. f = 4 and f = 8 sit at the bottom of the U: cheap to run, but still large enough to keep the meaning.

In the cross-attention, does the query come from the image or the text?
From the image. The query Q is built from the U-Net’s own feature map φ(z_t); the keys and values K, V are built from the condition τθ(y). So the query for a patch that is turning into a squirrel attends most to the word “squirrel” and weights its value highest; the condition supplies only keys and values, never queries.

Footnotes & further reading

The diffusion model assumes its input has a sensible scale, so for the KL-regularized latents the paper rescales $z$ by its measured component-wise standard deviation before training the diffusion model on it. The VQ latents already sit near unit variance and need no rescaling. The signal-to-noise ratio this sets matters when sampling images larger than the training resolution in a convolutional fashion. ↩
The benchmark numbers in the paper's tables use guidance scale $s = 1.5$ , but the eye-catching text-to-image samples in its Figure 5 use a much stronger $s = 10.0$ . Classifier-free guidance is Ho & Salimans, Classifier-Free Diffusion Guidance (NeurIPS 2021 workshop). Note that the explainer's scale $s$ corresponds to $1+w$ in that paper's notation. ↩
The paper: Rombach, Blattmann, Lorenz, Esser, Ommer, High-Resolution Image Synthesis with Latent Diffusion Models (CVPR 2022). Code.
The diffusion mechanics LDM runs in latent space: Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (2020), and our DDPM explainer.
The first-stage lineage: Esser, Rombach, Ommer, Taming Transformers for High-Resolution Image Synthesis (VQGAN, 2021), building on van den Oord, Vinyals, Kavukcuoglu, Neural Discrete Representation Learning (VQ-VAE, 2017).
The first-stage losses: Zhang, Isola, Efros, Shechtman, Wang, The Unreasonable Effectiveness of Deep Features as a Perceptual Metric (LPIPS), and Isola, Zhu, Zhou, Efros, Image-to-Image Translation with Conditional Adversarial Networks (the PatchGAN discriminator).
Classifier guidance, the predecessor: Dhariwal, Nichol, Diffusion Models Beat GANs on Image Synthesis (ADM, 2021).
The scaled-up product built on this architecture: the Stable Diffusion release, a latent diffusion model conditioned on a frozen CLIP ViT-L/14 text encoder.

Got feedback?

How could this explainer be improved? Found an error, or something unclear? I read every message.