5

I want to know why diffusion models always use U-Net.

In my opinion, they use U-Net because you can see features of different resolutions and skip connection is good to add detail of images. But I am not sure if it is the main reason why they use U-Net.

Are there other reasons they choose U-Net rather than other architectures?

Penguin.jpg
  • 51
  • 1
  • 1
  • 5

3 Answers3

8

I don't have a definitive answer but I'd state my intuitions anyways:

Diffusion models are highly related to the idea of stacked denoising autoencoders [Kumar et al. (2014)]. Additionally, U-Net-like architectures are a very common architecture for autoencoders on images. Here, I would agree with your intuition that the bottleneck and the skip-connections help with denoising because they provide representations of different granularity.

Another thought is that U-Nets are empirically well-suited for image segmentation problems. Despite the fact that pixels are classified individually, you want the segmentation output to have consistent areas on top of objects in the image. So you kind of turn a very noisy segmentation mask (original image input) into a mask with much less noise (segmented output).

I think the latter is debatable, I'd be happy to hear your thoughts.

Chillston
  • 1,878
  • 7
  • 13
1

They do not.

Even "the" seminal paper does not (it was put on axiv before "U-Net" was a thing :).

Warpig
  • 11
  • 2
0

Generally, U-Nets are the default choice when dealing with functions where inputs and outputs have the same size/shape (such as in image segmentation) due to the lack of a real information bottleneck. Thus, those are a natural good choice when dealing with diffusion models, where the network must predict the residual noise.

akanai
  • 1