1

I'm new to diffusion models so I'm trying to familiarize myself with the theory.

In the article Score-Based Generative Modeling through Stochastic Differential Equations (Song and al.), it's explained that we need to solve the reverse-time SDE to obtain samples from image distribution $p_{0}$:

$$\text d \mathbf{x} = [\mathbf{f}(\mathbf{x},t) − g(t)^{2} \nabla_{\mathbf{x}}\log p_{t}(\mathbf{x})]dt+g(t) \text d \overline{\mathbf{w}}$$

Thus, we need to estimate the score $\nabla_{\mathbf{x}}\log p_{t}(\mathbf{x})$ to solve the previous equation and we then train a neural network to predict it (with score matching, slice score matching, etc).

However, in practice, I've seen many codes (event recent ones) training their neural networks to predict the noise $\varepsilon$ knowing $\mathbf{x}_{t}$ and $t$ (like in DDPM).

So I'm trying to understand the connexion between the noise $\varepsilon$ and the score $\nabla_{\mathbf{x}}\log p_{t}(\mathbf{x})$. I know that for Gaussian transition kernels the training objective is the same (up to a coefficient, see the first article), but it seem very restrictive as the Gaussian transition kernel assumption is only valid for affine drift coefficient $\mathbf{f}(\mathbf{x},t)$.

Thanks for your help,
Pepper08

Pepper08
  • 13
  • 2

1 Answers1

0

The loss function for noise prediction in DDPMs is typically the mean squared error between the true noise and the predicted noise formally expressed as $\mathbb{E}_{\mathbf{x}_0,ϵ,t}[∥ϵ−\hat{ϵ}_θ(\mathbf{x}_t,t)∥^2]$ which turns out to be equivalent to score matching objective only when there's an affine drift term $\mathbf{f}(\mathbf{x},t)$ in the reverse-time SDE implying a Gaussian transition kernel as you rightly noted .

Predicting noise often provides a more stable training objective since noise is simpler to model diffusion directly, especially when using Gaussian transitions with a fixed noise schedule. Therefore predicting the noise corresponds directly to the process of denoising, making it intuitive and aligning with the iterative denoising steps used in the reverse diffusion. Even if the Gaussian kernel transition assumption is somewhat restrictive, it has been found empirically that training with noise prediction yields high-quality realistic samples, so it remains a popular choice in practice.

cinch
  • 11,000
  • 3
  • 8
  • 17