The original paper for style transfer only supports one image as an input. However I was interested if there is a model that uses multiple images, preferrably in a way that doesn't require users to provide their own image files, just requiring the model to be trained on images of a single style and then imitating that style without the original image files. Can this be achieved using variants of the algorithms of that paper or should other neural network types be used (eg. GANs, diffusion models...)
1 Answers
For text-to-image Diffusion Models there are some popular customization techniques, like Textual Inversion, DreamBooth and Custom Diffusion.
Textual Inversion performs a few-shot finetuning of a text embedding $V^*$ of the model's text encoder, on a set of input images (usually 3 to 5). Specifically, after finetuning, style stransfer is performed by providing the following prompt, including the learned embedding $V^*$:
Painting of X in the style of $V^*$ art
DreamBooth uses a similar few-shot finetuning strategy, but it instead optimizes the whole model, obtaining better quality samples at the cost of memory-intensive finetuning and a heavy checkpoint (as it includes the whole model)
Custom Diffusion finetunes only a small subset of attention layers, reducing fine-tuning time and cost compared to DreamBooth, and allows the composition of multiple learned concepts, a setting where previous approaches struggled.
- 410
- 3
- 9