3

I'm trying to implement a variational auto-encoder (as seen in Section 3.1 here: https://arxiv.org/pdf/2004.06271.pdf).

It differs from a traditional VAE because it encodes its input images to three-dimensional latent feature maps. In other words, the latent feature maps have a width, height and channel dimension rather than just a channel dimension like a traditional VAE.

When calculating the Kullback-Liebler divergence as part of the loss function, I need the mean and covariance that is the output of the encoder. However, if the latent feature maps are three-dimensional, this means that the output of the encoder is three-dimensional, and therefore each latent feature is a 2D matrix.

How can I derive a mean and covariance from a 2D matrix to calculate the KL divergence?

magmacollaris
  • 35
  • 1
  • 4

1 Answers1

1

Your three dimensional latent representation consists of two images of mean pixels and covariance pixels as shown in Fig. 3. Which represents a Gaussian distribution with the mean and covariance for each pixel in the latent representation. Each pixel value is a random variable.

Now, have a close look at KL-loss Eq. 3 and it's corresponding description in the paper:

$$\mathcal{L}_{KL} = \frac{1}{2 \times (\frac{W}{16} \times \frac{H}{16}) } \sum^M_{m = 1}[\mu^2_m + \sigma^2_m - \log(\sigma^2_m) - 1]$$

Finally, $M$ is the dimensionality of the latent features $\theta \in \mathbb{R}^M$ with mean $\mu = [\mu_1,...,\mu_M]$ and covariance matrix $\Sigma = \text{diag}(\sigma_1^2,...,\sigma_M^2)$, [...].

The covariance matrix is diagonal, thus all pixel values are independent of each other. That is the reason why we have this nice analytical form for the KL-divergence given by Eq. 3. Therefore you can treat your 2D random matrix simply as a random vector of size $M = \frac{W}{16} \times \frac{H}{16}$ ($\times 3$ if you like to include color dimension). The third dimension (RGB channel) can be considered independent as well, therefore it can be also flattened to a vector and appended. Indeed this is what is done in the paper indicated by the second half of the sentence from above:

that are reparameterized by via sampling from a standard multivariate Gaussian $\epsilon \sim \mathcal{N}(0,I_M)$, i.e. $\theta = \mu + \Sigma^{\frac{1}{2}}\epsilon$.

Tinu
  • 628
  • 1
  • 4
  • 14