2

I was reading the article on medium, where the author cites this equation for Variational Inference: \begin{align*} \text{KL}(q(z|x^{(i)})||p(z|x^{(i)})) &= \int_z q(z|x^{(i)})\text{log}\frac{q(z|x^{(i)})}{p(z|x^{(i)})} dz \\ &= \mathbb{E}_{||}[\text{log}(q(z|x^{(i)}))] - \mathbb{E}_{||}[\text{log}(p(z|x^{(i)}))]\\ &= \mathbb{E}_{||}[\text{log}(q(z|x^{(i)}))] - \mathbb{E}_{||}[\text{log}(p(x^{(i)}, z))] + \mathbb{E}_q[\text{log}(p(x^{(i)}))]\\ &= \mathbb{E}_{||}[\text{log}(q(z|x^{(i)}))] - \mathbb{E}_{||}[\text{log}(p(x^{(i)}, z))] + \text{log}(p(x^{(i)}))\\ &= -\text{ELBO} + \text{log}(p(x^{(i)}))\\ \end{align*}

I understand all of the math behind this equation, but I was wondering what is the underlying intuition behind each of the terms in this equation (KL divergence, ELBO, and logp(x))?

The author claims that $\text{log} p(x)$ is a constant in this equation and I'm having a hard time understanding why. Is $p(x)$ considered to be the theoretical data generating distribution which created our $x$'s and not the model that we are training?

vl_knd
  • 496
  • 4
  • 14

2 Answers2

0

Variational methods are designed for situations allowing us to avoid the intractable integral such as $p(x)$ parameterized by the amortized parameter $\theta$ along with $p(z|x)$ by transforming the Bayesian inference problem into an optimization problem wrt $\theta$ and additional $\phi$ parameterizing the needed variational $q(z|x)$. And since in your reference the marginal evidence $p(x)$ defined in the latent variable model evaluated at the $i$-th example of the training set (note your reference uses $p(x^{(i)})$ before notation simplification) integrates out $p_{\theta}(x|z)$ by all possible values of the latent variable $z$ to induce the said example, it's a (possibly very small) constant wrt the optimized parameters $\theta$ and $\phi$ and thus irrelevant of $q_{\phi}(z|x)$. But note for your confusion the marginal likelihood $p(x)$ itself as evidence is essentially a GMM as explained in VAE's wikipedia reference.

To further avoid the intractable $p(x)$ in the ELBO, your reference arrives at the same convenient form of ELBO as mentioned in wikipedia:

$${L_{\theta ,\phi }(x):=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln {\frac {p_{\theta }(x,z)}{q_{\phi }({z|x})}}\right]=\ln p_{\theta }(x)-D_{KL}(q_{\phi }({\cdot |x})\parallel p_{\theta }({\cdot |x}))}$$ The form given is not very convenient for maximization, but the following, equivalent form, is: $${L_{\theta ,\phi }(x)=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln p_{\theta }(x|z)\right]-D_{KL}(q_{\phi }({\cdot |x})\parallel p_{\theta }(\cdot ))}$$ where ${\ln p_{\theta }(x|z)}$ is implemented as ${-{\frac {1}{2}}\|x-D_{\theta }(z)\|_{2}^{2}}$, since that is, up to an additive constant, what ${x\sim {\mathcal {N}}(D_{\theta }(z),I)}$ yields. That is, we model the distribution of $x$ conditional on $z$ to be a Gaussian distribution centered on ${D_{\theta }(z)}$.

cinch
  • 11,000
  • 3
  • 8
  • 17
0

Let me try to give you a less mathematical, more intuitive explanation.

You assume a latent space $\mathcal{Z}$ and you approximate the true distribution $p(x)$ as: $$ p_\theta(x) = \sum_\mathcal{Z} p_\theta(x|z) p_Z(z) = \mathbb{E}_{z \sim p_Z} p_\theta(x|z) $$ Here you use a neural net with parameters $\theta$ to model $p(x|z)$.

However, there is a problem that arises in this setting. In order to optimize our model we actually want to sample $z$ that would be a good match for the given data point $x$, but there is actually very little chance that that would be the case. If there really exists an underlying latent space to your data, then the best distribution to sample $z$ from is $p(z|x)$. So what you do now is you try to approximate this with a second neural network $q_\phi(x)$.

You want to maximize the log-probability of the training data by optimizing $\theta$:

\begin{aligned} \max_\theta \log p_\theta(x) &= \log \mathbb{E}_{z \sim q_\phi(x)} \bigg[ p_\theta(x|z) \frac{p_Z(z)}{q_\phi(z|x)} \bigg] \\ & \geq \mathbb{E}_{z \sim q_\phi(x)} \log \bigg[ p_\theta(x|z) \frac{p_Z(z)}{q_\phi(z|x)} \bigg] \\ & = \mathbb{E}_{z \sim q_\phi(x)} \bigg[ \log p_\theta(x) - KL(q_\phi(x) \; || \;p_Z) \bigg] \end{aligned}

And you also want to minimize the KL divergence between $q_\phi(x)$ and $p(z|x)$ by optimizing $\phi$:

\begin{aligned} \min_\phi KL(q_\phi(z|x) \; || \; p_\theta(z|x) ) &= \mathbb{E}_{z \sim q_\phi(x)} \log \frac{q_\phi(z|x)}{p_\theta(z|x)} \\ &= \mathbb{E}_{z \sim q_\phi(x|z)} \Big[ q_\phi(z|x) - p_Z(z) - \log p_\phi(x|z) \Big] + \log p_\theta(x) \end{aligned}

Note that the last term in the objective is constant with respect to $\phi$, so in the end we are left with the same equation but this time we optimize over $\phi$.

Finally, the objective is:

$$ \max_{\theta, \phi} \mathbb{E}_{z \sim q_\phi(x)} \bigg[ \log p_\theta(x) - KL(q_\phi(x) \; || \;p_Z) \bigg] $$

pi-tau
  • 995
  • 6
  • 12