Why is ELBO more tractible than computing the marginal likelihood in variational encoders

Question

In the context of variational encoders, even after the model is trained, people seem to shy away from trying to compute $$ p_{\theta}(x) = \int p_{\theta}(x|z)p(z)dz $$

I understand that this can pose significant problems especially when z is high dimensional (though I have even seen people not wanting to compute this quantity when z is two dimensional). So instead when people need to do stuff like model comparison via log likelihood ratio, they instead use upper and lower bounds (ELBO, EUBO, CUBO etc). What confuses me is why are these easier to calculate? Lets take for instance ELBO, even the simplified form from the original paper looks like

$$ ELBO(\theta, \phi) = E_{q_{\phi}(.|x)}[(x - D_{\theta}(z))^2] ~+~ ... $$

Which means it involves and integral using the conditional distribution $q_{\phi}(z|x)$. In fact people generally write the numerical form elbo which does a Monte Carlo integration with this conditional distribution via sampling $z = \mu_{\phi}(x) + \epsilon \sigma_{\phi}(x)$. This is how you would numerically compute $p_{\theta}(x)$. One might even write it as the expectation $E_{p(z)}(p_{\theta}(x|z))$ and I can not think of any reason why the density or the integrand here is simpler than the one in ELBO. Infact ELBO's density would involve evaluating the encoder part of the network for generating each sample where as for p(x) you just sample from a normal distribution p(z).

1- How is the Monte Carlo integration required to compute the ELBO simpler than that required for $p_{\theta}(x)$

2- Given than z are generally lower dimensional representations, is it still intractable to compute $p_{\theta}(x)$ via Monte Carlo sampling when say z is 2 dimensional?

I have seen some other topics on this but non seems to have received an answer that explains why the expectation is easier to calculated.

Why is the variational lower bound is easier to compute than the original marginal distribution?

How does using the ELBO in VAEs make the problem tractable?

cinch · Accepted Answer · 2025-02-05T00:51:12.930

The intractability of the evidence $p_{\theta}(x)$ isn’t solely a matter of dimensionality. The challenge arises mainly because $p_{\theta}(x|z)$ is typically a complex highly-nonlinear sensitive function given by a DNN decoder that may vary sharply over $z$ causing high variance estimation during Monte Carlo (MC) integration even if it's 2-dimensional. So any accurate MC sampling still requires extremely many samples to capture the likelihood estimation correctly. Additionally sampling a fixed prior $p(z)$ cannot backpropagate gradient to update decoder parameters during training or would result in high variance update if you enforce somehow, because such fixed sampling has no $θ$-dependency and thus is non-differentiable in the sense that it does not provide a path for gradients to flow back to learn to be adjusted as targeted sampling instead of high-variance sampling, though the decoder is differentiable wrt $θ$ for any fixed $z$.

In contrast, the variational posterior $q_{\phi}(z|x)$ of VAEs is usually chosen to be a simple diagonal Gaussian which means that sampling from it is straightforward with a low variance reconstruction error integrand and evaluating its density is efficient. Note in your above ELBO formulation we only need to sample from $q_{\phi}(z|x)$ which is tractable even in high-dimensional space due to above mentioned low-variance integrand and the reparameterization trick for further variance reduction. Finally the constructed ELBO can differentiate wrt both encoder and decoder parameters thanks to the reparameterization trick and closed-form expressions for the KL divergence part of the ELBO.

Why is ELBO more tractible than computing the marginal likelihood in variational encoders

1 Answers1