1

Currently I'm trying to understand how diffusion models work for a thesis I'm writing. I have an intuition on how it works, but I'm still trying to deepen my understanding. Something I keep stumbling over is a formula(shown below), from this paper, with the usage of a gaussian distribution. $$p(x_t|x_{t-1}) := \mathcal{N}(x_t;\sqrt{1-\beta_t}x_{t-1},\beta_tI)$$ This formula is Formula 6 on page 3.

Now I know the following:\

  • meaning of the gaussian distribution.
  • meaning of p being a probability function
  • $x_t$ is a sample after t diffusion steps

    What I lack an understanding of, is what exactly $x_t$ and $p(x_t)$ is. I know it is $x$ at point $t$ in time. What I don't understand is what is $x_t$ and $p(x_t)$ in practice? Is $x_t$ the color values over an image as distribution? Or is $x_t$ the pixel position and $p(x_t)$ is the the color value as a probability or density distribution? But if that were the case, why then is $x$ indexed over time and not $p$?
cycle
  • 13
  • 2

1 Answers1

0

Diffusion models' forward pass starts with an initial data sample $\mathbf{x}_0$ which is a fixed clean sample such as an image from its real data distribution. As isotropic Gaussian noise is added at each forward step $t$, $\mathbf{x}_t$ represents the noisy sample at step $t$ as a random variable and becomes progressively noisier approaching a pure Gaussian noise.

If your data is an image, $\mathbf{x}_t$ usually represents the image's entire set of pixel values which is often a high-dimensional real-valued vector flattened from the raw image format. For instance, for a $64 \times 64$ pixels RGB image with standard 3-color channels, $\mathbf{x}_t$ is a vector (tensor) in $\mathbb{R}^{64 \times 64 \times 3}$.

Color digital images are made of pixels, and pixels are made of combinations of primary colors represented by a series of code. A channel in this context is the grayscale image of the same size as a color image, made of just one of these primary colors. For instance, an image from a standard digital camera will have a red, green and blue channel. A grayscale image has just one channel... Another closely related concept is feature maps, which are used in convolutional neural networks.

If the RGB image is 24-bit (the industry standard as of 2005), each channel has 8 bits, for red, green, and blue—in other words, the image is composed of three images (one for each channel), where each image can store discrete pixels with conventional brightness intensities between 0 and 255.

Therefore $\mathbf{x}_t$ doesn't represent pixel positions or color value's probability distribution. And since $\mathbf{x}_t$ changes progressively along each step $t$, its probability distribution $p(\mathbf{x}_t)$ changes accordingly. You don't need to index $p$ additionally since $p(\mathbf{x}_t)$ depends on $t$ implicitly through $\mathbf{x}_t$.

cinch
  • 11,000
  • 3
  • 8
  • 17