In style transfer, why does the comparison between channels give a good sense of style?

Question

I have been learning about Style Transfer recently. Style is defined as

The correlation of activations between channels.

I can't seem to understand why that would be true. Intuitively, style seems to be the patterns that exist in one particular channel/image rather than the patterns between channels. When filters in CNNs have different weights for acting as filters for different channels, why do we even expect 2 channels to be correlated? And further, why do we expect the style to be conveyed by it?

I expected a style function that could compare activations in some layer of a CNN condensed into one channel so that an algorithm can search for which activations occur simultaneously and hold style information.

I understand how we are carrying out the operations with the matrix and defining the loss function, what I don't get is why we are assuming style information lies in correlation between channels in the first place.

Kostya · Accepted Answer · 2021-05-17T20:38:01.853

When the original Neural Transfer paper was published, your question stayed unanswered for a while. The reason why the Gram matrix represents artistic style was not entirely clear. A satisfactory (in my opinion) explanation came with the "Demystifying Neural Style Transfer" paper.

The basic idea is that you cannot just directly compare activations for two images. The spatial positions of various features are different for the source and target image, so you should somehow get rid of positional information of the activations and compare their distributions across the whole image. The goal of the style transfer task is thus to make the two distributions as close as possible. One of the possible measures of distance between two distributions $P$ and $Q$ is the Maximum Mean Distance (MMD):

$$\text{MMD}^2(P,Q) = \left\Vert \mathbb{E}_{P}[\phi(X)] - \mathbb{E}_{Q}[\phi(Y)]\right\Vert^2$$

With $\phi(\cdot)$ being a feature function - in our case it would be the NN activations in a particular layer. The next step would be to apply a "kernel trick" to the MMD, representing it through a kernel $k(x,y)$

$$\text{MMD}^2(P,Q) = \mathbb{E}[k(x_i,x_j)] + \mathbb{E}[k(y_i,y_j)] - 2 \mathbb{E}[k(x_i,y_j)]$$

The Gram matrix of the original style transfer corresponds to the squared dot-product kernel $k(x,y) = (x^Ty)^2$: $$\text{MMD}^2(P,Q) = \mathbb{E}[(x^T_ix_j)^2] + \mathbb{E}[(y^T_iy_j)^2] - 2 \mathbb{E}[(x^T_iy_j)^2] = \left\Vert G^x - G^y\right\Vert_F^2 $$

Where $G^x$ and $G^y$ are the Gram matrices for activations and $\Vert\cdot\Vert_F$ is the Frobenius norm. The last equality gives a bit of trouble to those not used to the kernel trick, so I'll expand it (assuming Einstein summation convention):

$$\begin{array}{l} \mathbb{E}[(x^T_ix_j)^2] + \mathbb{E}[(y^T_iy_j)^2] - 2 \mathbb{E}[(x^T_iy_j)^2] = \\ = x_{ik}x_{jk}x_{im}x_{jm} + y_{ik}y_{jk}y_{im}y_{jm} - 2 x_{ik}x_{jk}y_{im}y_{jm} \\ = (x_{ik}x_{im} - y_{ik}y_{im})(x_{jk}x_{jm} - y_{jk}y_{jm}) \\ = (G^x_{km} - G^y_{km})(G^x_{km} - G^y_{km}) \\ = \left\Vert G^x - G^y\right\Vert_F^2 \end{array} $$

The authors of the 1701.01036 also tried different kernel functions $k(x,y)$ getting more interesting style transfer results.

In style transfer, why does the comparison between channels give a good sense of style?

1 Answers1