High-Resolution Image Synthesis with Latent Diffusion Models
Diffusion got expensive. Move it off the pixels.
A pretrained autoencoder strips the imperceptible detail out of an image and leaves a small latent. Run the diffusion there instead of on pixels, add a cross-attention port for text, and high-resolution synthesis stops needing a data center.
Explaining the paperHigh-Resolution Image Synthesis with Latent Diffusion ModelsWhat if most of the bits in an image are detail the model never needed to learn?
By the end of 2021 the sharpest image generators were diffusion models. The recipe, the subject of our DDPM explainer, is plain: add Gaussian noise to an image over many small steps until it is static, train one network to undo a single step, then sample by running that network backward from noise. It makes beautiful pictures and trains stably, with none of the mode collapse that made GANs miserable. It also has a cost problem that had started to define the field.
That network runs on the image itself, a tensor with hundreds of thousands of numbers, and it has to run once per denoising step, hundreds of steps per image, for both training and sampling. Training the strongest pixel-space diffusion models took between 150 and 1000 V100-GPU-days. Producing 50,000 samples took about five days on a single A100. The quality was there. The price meant only a handful of labs could pay it.
Latent Diffusion, out of the CompVis group in Munich together with Runway, is the paper that cut the price without giving up the quality. The move fits in one sentence: do not run the diffusion on the pixels. Run it on a small, compressed code of the image, produced once by a separate autoencoder, and decode the result back to a picture at the very end. The same paper added a clean way to steer the generator with text or a layout, and it became the architecture behind Stable Diffusion. To see why it works, and why the compression does not wreck the images, we have to look at where an image actually spends its bits.
Where an image spends its bits
Take any natural image and ask how its information is distributed. Most of it is high-frequency detail: the exact texture of grass, the grain in a photograph, the precise value of one pixel against its neighbor. Nudge those details and a person cannot tell. They carry bits, and the bits cost a model capacity to learn, but they carry little meaning.
The authors make this concrete with a rate-distortion picture (their Figure 2). Plot how much an image is distorted, in a perceptual sense, against the rate: the number of bits you keep when you compress it. The curve is steep, then flat. The first few bits buy enormous drops in distortion, because they encode the things that matter: the layout and the objects. After an elbow, each extra bit barely changes anything you would notice. Those late bits are the imperceptible detail.
That splits compression into two regimes. Up to the elbow is semantic compression: the bits that decide what the image is. Past it is perceptual compression: detail you can drop with no visible loss. A pixel-space diffusion model does not know the difference. It spends its capacity, and your GPU-days, modeling every bit to the same standard, most of which the eye discards. Drag the operating point and watch the trade:
The plan follows directly. Put a cheap, dedicated model at the elbow to do the perceptual compression once, and let the expensive diffusion model work only on the bits that carry meaning. That cheap model is an autoencoder.
Stage one: a perceptual autoencoder
An autoencoder is two networks. An encoder compresses an image to a smaller representation. A decoder rebuilds the image from it. Train them together so the rebuild matches the original, and the representation in the middle has to capture whatever the decoder needs. Write the image as , a tensor of shape (height, width, three color channels). The encoder produces a latent , and the decoder reconstructs :
The latent has shape . The number that controls everything is the downsampling factor : how many times smaller each spatial side gets. At a 256 × 256 image becomes a 64 × 64 latent; at , a 32 × 32 one. The channel count is small, 3 or 4 in the good configurations. So the latent is a small image. It keeps the two-dimensional grid, just shrunk. A bigger means a smaller latent and a cheaper diffusion loop, but past some point the autoencoder has to discard structure rather than detail, and the whole paper is in some sense the choice of where those two pressures cross; Figure 4 below plots that crossing directly.
That grid is the point, and it is easy to miss. The latent stays laid out as a small spatial grid. It is not flattened into a one-dimensional string of code numbers. Earlier two-stage methods, VQGAN and DALL-E, also compressed images first, but to feed an autoregressive Transformer they flattened the latent into a 1D sequence and predicted it left to right. A 1D ordering throws away the fact that nearby pixels are related, and a Transformer over a long sequence is costly, so those methods needed aggressive compression (VQGAN at , DALL-E at ) and grew to billions of parameters. Latent Diffusion keeps the 2D grid because the diffusion model's generator, a convolutional U-Net, is built to work on grids. That is what lets it compress gently, at or , and still reconstruct faithfully.
How faithful? At the reconstruction has a PSNR of 27.4 dB and a reconstruction-FID of 0.58 (computed on ImageNet-val). (PSNR is pixel-level fidelity in decibels, higher is better; reconstruction-FID compares the distribution of reconstructed images to real ones, lower is better.) At it is 23.1 dB and 1.14. Push to and and the numbers fall off as real detail starts to vanish. Drag and watch the latent shrink and the reconstruction degrade:
How the autoencoder is trained matters as much as its shape, and it is not the pixel-error objective you might expect.
Keeping the latent from drifting
Train an autoencoder with a pixel-wise loss and you get blurry reconstructions. Averaging is the safe bet when you are scored pixel by pixel, and the average of plausible textures is mush. Latent Diffusion borrows VQGAN's first-stage recipe instead, with two ingredients that keep reconstructions crisp.
The first is a perceptual loss. Instead of comparing pixels, compare the two images in the feature space of a pretrained network. This is LPIPS, from a paper aptly titled The Unreasonable Effectiveness of Deep Features as a Perceptual Metric, and it tracks human judgments of similarity far better than pixel distance. The second is a patch-based adversarial loss. A small discriminator (a PatchGAN, from the pix2pix paper) inspects local patches of the reconstruction and tries to tell them from real ones, and the autoencoder is trained to fool it. That forces local realism: textures that look right up close, not just on average.
Together these keep reconstructions on the image manifold, the paper's phrase for "looks like a real image," and dodge the blur of pixel losses. The full objective is a min-max between the autoencoder and the patch discriminator :
Read the terms left to right. is the reconstruction loss (the perceptual one plus a pixel term). The middle pair is the adversarial game: the discriminator learns to score real images high and reconstructions low, and the autoencoder pushes the other way. is a regularizer, and it is doing something specific.
Without a regularizer the encoder could place its latents anywhere, at any scale, and an arbitrarily spread-out latent space is a poor place to train a second model. So the paper pins the latent down, with one of two light touches.
KL-regularization adds a small penalty pulling the latent toward a standard normal, the same term a variational autoencoder uses. The word "small" is load-bearing: the weight is about , a thousandth of a thousandth. This is not a real VAE prior, which would trade reconstruction quality for a tidy latent. It is just enough to keep the latent from drifting to extreme values, and the autoencoder goes on reconstructing faithfully. The tension is two-sided: a strong prior would buy a tidy Gaussian latent at the price of reconstruction fidelity, no prior at all would let the latent drift to arbitrary scales, and the tiny weight is the compromise that pins the scale while letting reconstruction win every other argument. VQ-regularization instead snaps each latent vector to the nearest entry in a learned codebook, the vector-quantization trick from VQ-VAE. The paper folds the quantization step into the decoder, so this variant is, in its words, a VQGAN with the quantization layer absorbed by the decoder. Either regularizer works, and the choice barely changes the downstream samples.1
Stage two: diffuse in the latent
With the autoencoder trained and then frozen, meaning its weights are no longer updated, the second stage is a diffusion model, and here Latent Diffusion changes exactly one thing about DDPM.
A diffusion model learns a data distribution by reversing a fixed noising process. (The mechanics are in our DDPM explainer; the short version follows.) You add Gaussian noise to a sample in many small steps until it is indistinguishable from noise, then train a single network to look at a noised sample and predict the noise that was added. Knowing the noise is the same as knowing the clean sample, so a trained can walk noise back into data. The training loss is a plain mean-squared error between the true noise and the prediction:
In a pixel-space model is a noised image and runs on the full tensor. Latent Diffusion runs the identical objective, but on the latent. Encode the image once to , noise the latent, and train to denoise :
The two equations differ by one symbol, becoming , and that symbol is the whole paper. The denoiser is still a U-Net, the architecture DDPM uses, and because the latent is a 2D grid the U-Net's convolutions apply directly. (The U-Net is the subject of its own explainer.) Its attention layers carry over just as unchanged, because is still a small spatial grid rather than a flattened code; the only thing that moved is how many cells each denoising step has to touch. But now it runs on a tensor that is times smaller in area. At that is a 64-fold cut in the spatial size the expensive network has to process, every step, for training and sampling alike. The training step is the one you already know, with the encode line bolted on the front:
# stage two, pixel-space DM: the U-Net runs on the whole image
x = sample_image() # [H, W, 3] e.g. 256x256x3
t = randint(1, T)
eps = randn_like(x) # the noise the net must predict
x_t = noise(x, eps, t) # forward-noise the image (DDPM)
loss = mse(eps, eps_theta(x_t, t)) # eq (1)
# latent diffusion: encode ONCE with the frozen E, then diffuse z
z = E(x) # [h, w, c] e.g. 32x32x4
eps = randn_like(z)
z_t = noise(z, eps, t) # same forward process, on the latent
loss = mse(eps, eps_theta(z_t, t)) # eq (2) -> f^2 fewer spatial cellsThe two-stage shape pays off twice. Sampling runs the denoiser over the small latent for all its steps, then decodes once with to get a full-resolution image, so the heavy sequential loop never touches pixels. And the autoencoder is universal: train it once, and reuse the same latent space for many different diffusion models and tasks. The whole system, assembled:
Because the forward noising is fixed, can be produced from on the fly during training, and a finished sample needs a single pass through . The encoder and decoder are each used once; everything in between happens in the small latent.
How much to compress
is the dial that sets everything, and it has a sweet spot. At there is no compression at all: this is pixel-space diffusion, the slow and expensive baseline the paper is trying to beat. At the other end, a large means a tiny latent, which sounds efficient until you remember the autoencoder had to throw away real content to get there. Too much compression and the reconstruction quality, the ceiling on everything downstream, drops.
In between is a basin. The paper trains class-conditional models at on ImageNet under a fixed compute budget and watches the sample quality, scored by FID. (This is sample-quality FID, how close the generated images are to real ones; it is the stage-two cousin of the reconstruction-FID from before, same idea, lower is better.) Pixel diffusion () trains slowly and lands far behind; after two million steps the gap to is about 38 FID. Over-compression (, ) stalls early at a worse ceiling. and sit at the bottom of the U, and the paper picks them for most of its models. Scrub and watch quality fall off on both sides:
There is a second axis the figure does not show: speed. A larger always trains and samples faster, because the latent is smaller. So the real choice is for the best quality, or for a smaller, faster latent at a small cost in quality. The complex datasets like ImageNet want the gentler ; the paper's text-to-image model uses .
Steering it with cross-attention
A generator you cannot steer is a curiosity. The second half of the paper makes the diffusion model take direction, from text or a class label or a segmentation map, through one general mechanism.
A diffusion model is conditionable in principle: feed the denoiser the conditioning alongside the noised input and write . The question is how to feed in something like a sentence, whose shape has nothing to do with the image grid. The answer is cross-attention, the operation that lets a Transformer relate two different sequences. The shape mismatch is the real problem: a latent is a grid, a prompt is a sequence, and cross-attention lets each grid location query the sequence for what matters there, which neither concatenation (the shapes do not line up) nor a single global summary vector (all spatial control lost) can do. First, run the conditioning through a domain-specific encoder that turns it into a set of vectors . For text, is a Transformer over the tokenized prompt; for a class label it can be a lookup. Then, inside the U-Net, attention links the image to it:
The direction matters. The query is built from the image side, a flattened intermediate feature map of the U-Net. The keys and values are built from the condition . So each location in the image asks a question and the condition answers: a patch that is becoming a squirrel attends to the word "squirrel" and pulls in its value. Hover over the image grid and watch which words a patch attends to:
The conditioning encoder and the denoiser are trained together, end to end, on image-condition pairs. The loss is the latent objective from before with threaded through:
One mechanism, many modalities. For spatially-aligned conditions where the control is itself a grid (a low-resolution image for super-resolution, a mask for inpainting, a semantic map), the paper skips the attention and concatenates the condition onto the denoiser's input, which is cheaper and works because the grids line up.
One more lever sharpens the conditional samples: classifier-free guidance. During training the condition is randomly dropped, so the same network learns both a conditional and an unconditional denoiser. At sampling you extrapolate away from the unconditional prediction toward the conditional one,
Read the formula as two predictions and a nudge. At every denoising step the network runs twice: once with the prompt, , and once without it, . Their difference is the direction the prompt is pulling the sample, the part of the prediction that exists only because of the condition. The scale says how hard to follow that pull: is the ordinary conditional prediction, and a larger pushes further along the prompt's direction than the model alone would go, sharpening how closely the sample obeys the text. Push it too far and the extrapolation overshoots into oversaturated, less varied images, which is the diversity-for-fidelity trade. (This is distinct from the older classifier guidance, which needs a separate image classifier and steers by its gradient.) The paper's reported benchmark numbers use a modest .2
So what does it actually do
The headline is efficiency without a quality tax. Every model in the paper trains on a single A100, against the 150-to-1000 V100-days the pixel-space competition spent, and the samples are competitive or better.
On class-conditional ImageNet, the model scores FID 10.56, and with guidance FID 3.60 at 400M parameters, ahead of the much larger ADM (FID 10.94 at 554M, or 4.59 with its classifier guidance at 608M). On unconditional CelebA-HQ faces it sets a new state-of-the-art FID of 5.11. On text-to-image over MS-COCO the 1.45B-parameter model reaches FID 12.63 with guidance, on par with GLIDE (6B parameters) and Make-A-Scene (4B) at a fraction of the size. On inpainting it sets a new state of the art and runs at least 2.7 times faster than the pixel-space equivalent while scoring better.
The text-to-image model is where this paper became famous, and it is worth being precise about what it is and is not. The paper trains a 1.45B-parameter latent diffusion model on LAION-400M, with text fed in through a Transformer trained from scratch on a BERT tokenizer. It is not Stable Diffusion. Stable Diffusion, released the following year, is a latent diffusion model in the same family, trained at much larger scale, with one swap: the from-scratch text Transformer is replaced by a frozen, pretrained CLIP text encoder (the ViT-L/14 variant). The architecture in this paper is the thing that made Stable Diffusion possible; the household name is a later, scaled-up instance of it.
Two limits remain. Sampling is still a sequential loop and slower than a GAN's single forward pass. And the autoencoder's reconstruction sets a hard ceiling: for tasks that need exact pixels, that reconstruction, faithful as it is, can become the bottleneck. The paper names super-resolution as the case where this bites.
Step back and the idea is small. Most of an image is detail you cannot see. Compress that away once, with an autoencoder trained to keep reconstructions realistic, and run the expensive diffusion model only on what is left. Add a cross-attention port so text or a layout can steer it. Nothing about the diffusion math changed. The paper just moved it somewhere cheaper, and that one move is what put image generation on a consumer GPU.
Questions you might still have
Is Latent Diffusion the same thing as Stable Diffusion?
No, though they are close kin. This paper’s text-to-image model is a 1.45B-parameter LDM trained on LAION-400M, with text encoded by a Transformer trained from scratch on a BERT tokenizer. Stable Diffusion, released the following year, is a latent diffusion model in the same family trained at much larger scale, with the text encoder swapped for a frozen, pretrained CLIP ViT-L/14. The architecture here is what made Stable Diffusion possible; the famous product is a scaled-up instance of it with a different text encoder.
If the autoencoder throws detail away, why aren’t the samples blurry?
Because the discarded bits are perceptually irrelevant by construction. The autoencoder is trained with a perceptual loss and a patch discriminator, not pixel error, so its reconstructions stay on the image manifold rather than averaging into mush. At f = 4 the reconstruction FID is 0.58 and PSNR 27.4 dB; the eye cannot tell. The loss only shows up for tasks that need exact pixels.
Why not compress harder to make the diffusion even cheaper?
Past the elbow of the rate-distortion curve you start deleting semantic content, not just detail. The paper’s f = 16 and f = 32 models stall at a worse quality ceiling because the autoencoder threw away too much. f = 4 and f = 8 sit at the bottom of the U: cheap to run, but still large enough to keep the meaning.
In the cross-attention, does the query come from the image or the text?
From the image. The query Q is built from the U-Net’s own feature map φ(z_t); the keys and values K, V are built from the condition τθ(y). So a patch that is turning into a squirrel reads from the word “squirrel” and pulls in its value, while the prompt itself never queries the image.
Footnotes & further reading
- The diffusion model assumes its input has a sensible scale, so for the KL-regularized latents the paper rescales by its measured component-wise standard deviation before training the diffusion model on it. The VQ latents already sit near unit variance and need no rescaling. The signal-to-noise ratio this sets matters when sampling images larger than the training resolution in a convolutional fashion. ↩
- The benchmark numbers in the paper's tables use guidance scale , but the eye-catching text-to-image samples in its Figure 5 use a much stronger . Classifier-free guidance is Ho & Salimans, Classifier-Free Diffusion Guidance (NeurIPS 2021 workshop). Note that the explainer's scale corresponds to in that paper's notation. ↩
- The paper: Rombach, Blattmann, Lorenz, Esser, Ommer, High-Resolution Image Synthesis with Latent Diffusion Models (CVPR 2022). Code.
- The diffusion mechanics LDM runs in latent space: Ho, Jain, Abbeel, Denoising Diffusion Probabilistic Models (2020), and our DDPM explainer.
- The first-stage lineage: Esser, Rombach, Ommer, Taming Transformers for High-Resolution Image Synthesis (VQGAN, 2021), building on van den Oord, Vinyals, Kavukcuoglu, Neural Discrete Representation Learning (VQ-VAE, 2017).
- The first-stage losses: Zhang, Isola, Efros, Shechtman, Wang, The Unreasonable Effectiveness of Deep Features as a Perceptual Metric (LPIPS), and Isola, Zhu, Zhou, Efros, Image-to-Image Translation with Conditional Adversarial Networks (the PatchGAN discriminator).
- Classifier guidance, the predecessor: Dhariwal, Nichol, Diffusion Models Beat GANs on Image Synthesis (ADM, 2021).
- The scaled-up product built on this architecture: the Stable Diffusion release, a latent diffusion model conditioned on a frozen CLIP ViT-L/14 text encoder.
How could this explainer be improved? Found an error, or something unclear? I read every message.