1

I am looking at appendix C of the VAE paper:

It says:

C.1 Bernoulli MLP as decoder

In this case let $p_{\boldsymbol{\theta}}(\mathbf{x} \mid \mathbf{z})$ be a multivariate Bernoulli whose probabilities are computed from $\mathrm{z}$ with a fully-connected neural network with a single hidden layer: $$ \begin{aligned} \log p(\mathbf{x} \mid \mathbf{z}) &=\sum_{i=1}^{D} x_{i} \log y_{i}+\left(1-x_{i}\right) \cdot \log \left(1-y_{i}\right) \\ \text { where } \mathbf{y} &=f_{\sigma}\left(\mathbf{W}_{2} \tanh \left(\mathbf{W}_{1} \mathbf{z}+\mathbf{b}_{1}\right)+\mathbf{b}_{2}\right) \end{aligned} $$ where $f_{\sigma}(.)$ is the elementwise sigmoid activation function, and where $\theta=\left\{\mathbf{W}_{1}, \mathbf{W}_{2}, \mathbf{b}_{1}, \mathbf{b}_{2}\right\}$ are the weights and biases of the MLP.

C.2 Gaussian MLP as encoder or decoder

In this case let encoder or decoder be a multivariate Gaussian with a diagonal covariance structure: $$ \begin{aligned} \log p(\mathbf{x} \mid \mathbf{z}) &=\log \mathcal{N}\left(\mathbf{x} ; \boldsymbol{\mu}, \boldsymbol{\sigma}^{2} \mathbf{I}\right) \\ \text { where } \boldsymbol{\mu} &=\mathbf{W}_{4} \mathbf{h}+\mathbf{b}_{4} \\ \log \sigma^{2} &=\mathbf{W}_{5} \mathbf{h}+\mathbf{b}_{5} \\ \mathbf{h} &=\tanh \left(\mathbf{W}_{3} \mathbf{z}+\mathbf{b}_{3}\right) \end{aligned} $$ where $\left\{\mathbf{W}_{3}, \mathbf{W}_{4}, \mathbf{W}_{5}, \mathbf{b}_{3}, \mathbf{b}_{4}, \mathbf{b}_{5}\right\}$ are the weights and biases of the MLP and part of $\boldsymbol{\theta}$ when used as decoder. Note that when this network is used as an encoder $q_{\phi}(\mathbf{z} \mid \mathbf{x})$, then $\mathrm{z}$ and $\mathrm{x}$ are swapped, and the weights and biases are variational parameters $\phi$.

So, it seems like, for a Bernoulli decoder, it only outputs a vector $\mathbf{y}$, which gets plugged into the log-likelihood formula. But then, for the Gaussian decoder, it outputs both $\boldsymbol{\sigma}$ and $\mu$. So, is it like 2 parallel layers, one calculating $\boldsymbol{\sigma}$ one calculating $\mu$?

Similar to how we get the $\mu$ and $\sigma$ of the encoder (which I am assuming the encoder ones are different from the decoder ones)?

And we plug it into the formula I derived in this link here, the log-likelihood to get the reconstruction loss?

This is the intuition I am getting, but I haven't seen it explicitly all in one place.

a12345
  • 243
  • 1
  • 7

1 Answers1

1

Yes, in the case of the Gaussian, you have two distinct layers (so weights and biases), one for the mean and the other for the variance, as the equations are telling us.

The mean is calculated with the weights $\mathbf{W}_{4}$ and bias $\mathbf{b}_{4}$ from $\mathbf{h}$ as follows

$$\boldsymbol{\mu} =\mathbf{W}_{4} \mathbf{h}+\mathbf{b}_{4},$$

while the variance (actually, equivalently, the log of the standard deviation) is calculated from $\mathbf{W}_{5}$ and $\mathbf{b}_{5}$ from $\mathbf{h}$ as follows

$$\log \sigma^{2} =\mathbf{W}_{5} \mathbf{h}+\mathbf{b}_{5}$$

Here you have a PyTorch implementation that uses 2 distinct linear/dense layers for doing this, but note that it is doing this only for the encoder to produce the latent vector $\mathbf{z}$.

Yes, generally, these layers (so the mean and variance) are not the same for the encoder and decoder. However, it would not be surprising to me if someone already tried to share some layers between the encoder and decoder for some specific task.

nbro
  • 42,615
  • 12
  • 119
  • 217