In variational autoencoders, why do people use MSE for the loss?

Question

In VAEs, we try to maximize the ELBO = $\mathbb{E}_q [\log\ p(x|z)] + D_{KL}(q(z \mid x), p(z))$, but I see that many implement the first term as the MSE of the image and its reconstruction. Here's a paper (section 5) that seems to do that: Don't Blame the ELBO! A Linear VAE Perspective on Posterior Collapse (2019) by James Lucas et al. Is this mathematically sound?

score 6 · Accepted Answer · edited Jun 12 '22 at 08:14

If $p(x|z) \sim \mathcal{N}(f(z), I)$, then

\begin{align} \log\ p(x|z) &\sim \log\ \exp(-(x-f(z))^2) \\ &\sim -(x-f(z))^2 \\ &= -(x-\hat{x})^2, \end{align}

where $\hat{x}$, the reconstructed image, is just the distribution mean $f(z)$.

It also makes sense to use the distribution mean when using the decoder (vs. just when training), as it is the one with the highest pdf value. So, the decoder produces a distribution from which we take the mean as our result.

nbro · Answer 2 · 2022-06-12T08:46:47.013

On page 5 of the VAE paper, it's clearly stated

We let $p_{\boldsymbol{\theta}}(\mathbf{x} \mid \mathbf{z})$ be a multivariate Gaussian (in case of real-valued data) or Bernoulli (in case of binary data) whose distribution parameters are computed from $\mathbf{z}$ with a MLP (a fully-connected neural network with a single hidden layer, see appendix $\mathrm{C}$ ).

...

As explained above and in appendix $\mathrm{C}$, the decoding term $\log p_{\boldsymbol{\theta}}\left(\mathbf{x}^{(i)} \mid \mathbf{z}^{(i, l)}\right)$ is a Bernoulli or Gaussian MLP, depending on the type of data we are modelling.

So, if you are trying to predict real numbers (in the case of images, these can be the RGB values in the range $[0, 1]$), then you can assume $p_{\boldsymbol{\theta}}(\mathbf{x} \mid \mathbf{z})$ is a Gaussian.

It turns out that maximising the Gaussian likelihood is equivalent to minimising the MSE between the prediction of the decoder and the real image. You can easily show this: just replace $p_{\boldsymbol{\theta}}(\mathbf{x} \mid \mathbf{z})$ with the Gaussian pdf, then maximise that wrt the parameters, and you should end up with something that resembles the MSE. G. Hinton shows this in this video lesson. See also this related answer.

So, yes, minimizing the MSE is theoretically founded, provided that you're trying to predict some real number.

When the binary cross-entropy (instead of the MSE) is used (e.g. here), the assumption is that you're maximizing a Bernoulli likelihood (instead of a Gaussian) - this can also be easily shown.

In variational autoencoders, why do people use MSE for the loss?

2 Answers2