6.7 Latent Diffusion
High-Resolution Image Synthesis with Latent Diffusion Models
Created Date: 2025-06-16
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining.
However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations.
To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity.
By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner.
Our latent diffusion models (LDMs) achieve new state-of-the-art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including text-to-image synthesis, unconditional image generation and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.
6.7.1 Introduction
Image synthesis is one of the computer vision fields with the most spectacular recent development, but also among those with the greatest computational demands. Especially high-resolution synthesis of complex, natural scenes is presently dominated by scaling up likelihood-based models, potentially containing billions of parameters in autoregressive (AR) transformers.
In contrast, the promising results of GANs have been revealed to be mostly confined to data with comparably limited variability as their adversarial learning procedure does not easily scale to modeling complex, multi-modal distributions. Recently, diffusion models, which are built from a hierarchy of denoising autoencoders, have shown to achieve impressive results in image synthesis and beyond, and define the state-of-the-art in class-conditional image synthesis and super-resolution.
Moreover, even unconditional DMs can readily be applied to tasks such as inpainting and colorization or stroke-based synthesis, in contrast to other types of generative models. Being likelihood-based models, they do not exhibit mode-collapse and training instabilities as GANs and, by heavily exploiting parameter sharing, they can model highly complex distributions of natural images without involving billions of parameters as in AR models.
Democratizing High-Resolution Image Synthesis DMs belong to the class of likelihood-based models, whose mode-covering behavior makes them prone to spend excessive amounts of capacity (and thus compute resources) on modeling imperceptible details of the data.
6.7.2 Related Work
6.7.3 Method
To lower the computational demands of training diffusion models towards high-resolution image synthesis, we observe that although diffusion models allow to ignore perceptually irrelevant details by undersampling the corresponding loss terms, they still require costly function evaluations in pixel space, which causes huge demands in computation time and energy resources.
We propose to circumvent this drawback by introducing an explicit separation of the compressive from the generative learning phase. To achieve this, we utilize an autoencoding model which learns a space that is perceptually equivalent to the image space, but offers significantly reduced computational complexity.
Such an approach offers several advantages: (i) By leaving the high-dimensional image space, we obtain DMs which are computationally much more efficient because sampling is performed on a low-dimensional space.
(ii) We exploit the inductive bias of DMs inherited from their UNet architecture, which makes them particularly effective for data with spatial structure and therefore alleviates the need for aggressive, quality-reducing compression levels as required by previous approaches.
(iii) Finally, we obtain general-purpose compression models whose latent space can be used to train multiple generative models and which can also be utilized for other downstream applications such as single-image CLIP-guided synthesis.
6.7.3.1 Perceptual Image Compression
Our perceptual compression model is based on previous work and consists of an autoencoder trained by combination of a perceptual loss and a patch-based adversarial objective. This ensures that the reconstructions are confined to the image manifold by enforcing local realism and avoids bluriness introduced by relying solely on pixel-space losses such as \(L_2\) or \(L_1\) objectives.
More precisely, given an image \(x \in \mathbb{R}^{H \times W \times 3}\) in RGB space, the encoder \(\varepsilon\) encodes \(x\) into a latent representation \(z = \varepsilon(x)\), and the decoder \(\mathcal{D}\) reconstructs the image from the latent, giving \(\tilde{x} = \mathcal{D}(z) = \mathcal{D}(\varepsilon(x))\), where \(z \in \mathbb{R}^{h \times w \times c}\). Importantly, the encoder downsamples the image by a factor \(f = \frac{H}{h} = \frac{W}{w}\), and we investigate different downsampling factors \(f = 2^m\), with \(m \in \mathbb{N}\).
In order to avoid arbitrarily high-variance latent spaces, we experiment with two different kinds of regularizations. The first variant, KL-reg., imposes a slight KL-penalty towards a standard normal on the learned latent, similar to a VAE, whereas VQ-reg. uses a vector quantization layer within the decoder.
This model can be interpreted as a VQGAN but with the quantization layer absorbed by the decoder. Because our subsequent DM is designed to work with the two-dimensional structure of our learned latent space \(z = \varepsilon(x)\), we can use relatively mild compression rates and achieve very good reconstructions.
This is in contrast to previous works, which relied on an arbitrary 1D ordering of the learned space \(z\) to model its distribution autoregressively and thereby ignored much of the inherent structure of \(z\). Hence, our compression model preserves details of \(x\) better. The full objective and training details can be found in the supplement.
6.7.3.2 Latent Diffusion Models
Diffusion Models are probabilistic models designed to learn a data distribution \(p(x)\) by gradually denoising a normally distributed variable, which corresponds to learning the reverse process of a fixed Markov Chain of length \(T\).
For image synthesis, the most successful models rely on a reweighted variant of the variational lower bound on \(p(x)\), which mirrors denoising score-matching.
These models can be interpreted as an equally weighted sequence of denoising autoencoders \({\epsilon}_{\theta} (x_t, t)\); \(t = 1, \cdots , T\), which are trained to predict a denoised variant of their input \(x_t\), where \(x_t\) is a noisy version of the input \(x\). The corresponding objective can be simplified to:
with \(t\) uniformly sampled from \(1, \cdots , T\).
Generative Modeling of Latent Representations With our trained perceptual compression models consisting of \(\mathcal{E}\) and \(\mathcal{D}\), we now have access to an efficient, low-dimensional latent space in which high-frequency, imperceptible details are abstracted away. Compared to the high-dimensional pixel space, this space is more suitable for likelihood-based generative models, as they can now (i) focus on the important, semantic bits of the data and (ii) train in a lower dimensional, computationally much more efficient space.
Unlike previous work that relied on autoregressive, attention-based transformer models in a highly compressed, discrete latent space, we can take advantage of image-specific inductive biases that our model offers. This includes the ability to build the underlying UNet primarily from 2D convolutional layers, and further focusing the objective on the perceptually most relevant bits using the reweighted bound, which now reads:
6.7.3.3 Conditioning Mechanisms
Similar to other types of generative models diffusion models are in principle capable of modeling conditional distributions of the form \(p(z|y)\). This can be implemented with a conditional denoising autoencoder