I've noticed the background of images from diffusion models has a "dithering" pattern, where neighboring pixels are slightly off from one another. This makes them annoying to edit later, so it'd be good to correct these slight differences. There are a couple of quick fixes that may work:
- Append a single-stride convolution at the end of the decoder.
- Have a small diffusion model on 4x4 patches (also how upscaling is done?).
However, it seems there's a more fundamental problem. The universal function approximation theorems all require these networks to be a contraction map (or at least, Lipschitz continuous), but being one unit off is as bad as 255. You want activation functions that are expansion maps, potentially even discontinuous! However, while a residual network solves the vanishing gradient problem, it couldn't work with these new exploding gradients, so deep learning goes out the window.
A different way to fix this is by discretizing every now and then. You could put a Hopfield network in the middle, use spiking neural networks, or anneal*. These are a lot more difficult to train because they're discontinuous, and it might even be best to just train them with reinforcement learning.
Is there a simpler way of fixing this?
*: E.g. if you have a text diffusion model, you can take the latents for each token, softmax the predicted outputs, and harden your tokens before putting it back in the latent space. Slowly increasing the coolness (inverse-temperature) should let it converge (as long as earlier tokens are cooled faster).