How do stable diffusion models take the data into account

Question

I'm interested in how text to image models like Midjourney and Dall-E work, where you enter a text prompt and get as output some images. I started reading some papers on it and stumbled upon "Denoising Diffusion Probabilistic Models" - https://arxiv.org/pdf/2006.11239.pdf.

It describes the diffusion process and how there is a sequence of random variables, $x_t$, moving between them through a Markov chain, etc.

But neither in this paper nor in any others did I find an explanation for how the training data (which I assume is tuples of images and accompanying text) is actually used. How does it play a role in the likelihood/ objective function?

score 2 · Answer 1 · answered Jan 02 '24 at 06:12

I have recently taken a seminar on a similar let me explain you in brief,

The complete end-to-end process has 3 steps while inferencing:

Text Encoding (using CLIP Model)
Image Information Creator (Unet + Scheduler)
Image Decoder (Auto Encoder)

Text Encoding: These Encoding models are trained on Image to caption Database or similar, They are good in matching Text with Image data.

Image Information Creator (Unet + Scheduler): The Unet model is trained to Predict the noisy Part of the Image.

For the end-to-end process of Image generation. An attention module is introduced which will attend to the Text Embeddings and change the Image pixels in that direction

Scheduler: Scheduler will add noise to image after each iteration for variation of pixels.
VAE: All the processes mentioned above are in latent space for faster convergence and speed, VAE later converts the latent image to a pixel image.

How do stable diffusion models take the data into account

1 Answers1