3

Consider the following diagram from the paper titled High-Resolution Image Synthesis with Latent Diffusion Models by Robin Rombach et. al.,

enter image description here

In the context of this diagram, I'm uncertain about the functionality of a particular component referred to as the "switch." Based on my understanding, the conditioning information always flows to the denoising step and is directed either to the cross-attention module or to concatenate with $z_{T}$, but not to both simultaneously. Is my understanding correct? Could you explain why the conditioning information cannot be passed to both components at the same time? Is there a specific reason or mechanism described in the paper for this design choice?

hanugm
  • 4,102
  • 3
  • 29
  • 63

1 Answers1

1

Section 4.3.2 of the paper (on p. 7 in v2) answers this question:

By concatenating spatially aligned conditioning information to the input of $\epsilon_θ$, LDMs can serve as efficient general purpose image-to-image translation models. We use this to train models for semantic synthesis, super-resolution (Sec. 4.4) and inpainting (Sec. 4.5).

In other words, they switch the network based on the application. If the input is a class or text, they use the cross-attention mechanism. If the input is an image, they use the concatenation mechanism.

This makes a lot of sense. Textual and class guidance don't have spatial information, and thus cannot be concatenated to the spatial input to the denoising U-nets.

But class maps and images with to-be-inpainted regions are already spatial, and thus can be concatenated to the input to the U-Net. Cross attention would lose the spatial information these inputs provide.

This leads to a clear answer to your question: If you were guiding image generation both with text and with an existing image, you would likely use both paths.

Josiah Yoder
  • 191
  • 1
  • 8