I'm new to diffusion models so I'm trying to familiarize myself with the theory.
In the article Score-Based Generative Modeling through Stochastic Differential Equations (Song and al.), it's explained that we need to solve the reverse-time SDE to obtain samples from image distribution $p_{0}$:
$$\text d \mathbf{x} = [\mathbf{f}(\mathbf{x},t) − g(t)^{2} \nabla_{\mathbf{x}}\log p_{t}(\mathbf{x})]dt+g(t) \text d \overline{\mathbf{w}}$$
Thus, we need to estimate the score $\nabla_{\mathbf{x}}\log p_{t}(\mathbf{x})$ to solve the previous equation and we then train a neural network to predict it (with score matching, slice score matching, etc).
However, in practice, I've seen many codes (event recent ones) training their neural networks to predict the noise $\varepsilon$ knowing $\mathbf{x}_{t}$ and $t$ (like in DDPM).
So I'm trying to understand the connexion between the noise $\varepsilon$ and the score $\nabla_{\mathbf{x}}\log p_{t}(\mathbf{x})$. I know that for Gaussian transition kernels the training objective is the same (up to a coefficient, see the first article), but it seem very restrictive as the Gaussian transition kernel assumption is only valid for affine drift coefficient $\mathbf{f}(\mathbf{x},t)$.
Thanks for your help,
Pepper08