In the VAE, why is $z \sim \mathcal{N}(\mu, \sigma^2)$ equivalent to $z = \mu + \sigma \odot \epsilon$?

Question

In the reparameterization trick of a Variational Autoencoder (VAE), instead of sampling noise $z$ from $z \sim \mathcal{N}(\mu, \sigma^2)$, we can use a different method: $z = \mu + \sigma \odot \epsilon$, where $\epsilon \sim \mathcal{N}(0,1)$. I'm having trouble understanding why these two methods are equivalent ?

score 4 · Accepted Answer · answered Mar 09 '24 at 15:30

I'll attempt a less formal explanation.

The distribution $\mathcal{N}(\mu, \sigma)$ represents a normal distribution with mean $\mu$ and standard deviation $\sigma$. When we sample from this distribution, we obtain a value $z$.

Now, consider the expression $z = \mu + \sigma \odot \epsilon$. Here, $\epsilon$ is a random value sampled from a standard normal distribution, which is a normal distribution with mean $0$ and standard deviation $1$. By multiplying $\epsilon$ by $\sigma$ and adding $\mu$, we effectively scale and shift the standard normal distribution to match the desired mean $\mu$ and standard deviation $\sigma$.

In essence, this expression generates values ($z$) from a normal distribution with the specified mean and standard deviation, $\mathcal{N}(\mu, \sigma)$. By incorporating $\mu$ and $\sigma$ into the sampling process, we ensure that the resulting distribution maintains the desired characteristics.

score 2 · Answer 2 · answered Mar 09 '24 at 18:53

I'll give 2 cents.

The point is that Normal distribution can be shifted and rescaled:

if $X$ is a normal gaussian distribution with parameters $\mu$ and $\sigma^2$, then this $X$ distribution can be re-scaled and shifted via the formula $Z=(X-\mu )/\sigma$ to convert it to the standard normal distribution.

Now, if you write that equation like $X=f(Z)$, you get what you are looking for...

The proof that shows it formally can be found under many other many other answers

cinch · Answer 3 · 2024-03-11T06:25:29.767

From your unanswered comment it seems you're still not clear about the root cause and mechanism of VAE's reparameterization trick, so it would be beneficial to clarify your expressed confusion by the formal analysis of VAE reference surrogate objective function ELBO part right before the reparameterization:

A variational autoencoder is a generative model with a prior and noise distribution respectively... For example, a standard VAE task such as IMAGENET is typically assumed to have a gaussianly distributed noise; however, tasks such as binarized MNIST require a Bernoulli noise.

Maximizing the ELBO... The form given is not very convenient for maximization, but the following, equivalent form, is: $${L_{\theta ,\phi }(x)=\mathbb {E} _{z\sim q_{\phi }(\cdot |x)}\left[\ln p_{\theta }(x|z)\right]-D_{KL}(q_{\phi }({\cdot |x})\parallel p_{\theta }(\cdot ))}$$...The distribution of ${q_{\phi }(z|x)}$ and ${p_{\theta }(z)}$ are often also chosen to be Gaussians as ${z|x\sim {\mathcal {N}}(E_{\phi }(x),\sigma _{\phi }(x)^{2}I)}$ and ${z\sim {\mathcal {N}}(0,I)}$

Therefore the prior distribution of $z$ doesn't depend on any input data $x$, and its variational posterior distribution ${q_{\phi }(z|x)}$ is the noise distribution mentioned above and optimized to be close to its prior distribution as shown in the above equivalent ELBO form via chain rule which is convenient for maximization. On the other hand from the original ELBO form $\ln p_{\theta }(x)-D_{KL}(q_{\phi }({\cdot |x})\parallel p_{\theta }({\cdot |x}))$ mentioned in my reference whose KL-Divergence term only contains conditional posterior distributions of $z$ conditioned on input data $x$, thus obviously ${q_{\phi }(z|x)}$ can no longer simply be ${\mathcal {N}}(0,I)$ but usually simply assumed to be variance-scaled and mean-shifted as ${\mathcal {N}}(\mu,\sigma^2)$.

In summary, your reparameterized $\epsilon \sim \mathcal{N}(0,1)$ is exactly same as the unconditioned $z$, and the conditioned $z|x \sim \mathcal{N}(\mu, \sigma^2)$ is simply assumed to be scaled and shifted without any necessary theoretical basis.

Finally of course with basic probability knowledge of Gaussian distribution, if you're given a random variable transformation from $\epsilon$ to $z$, you can simply substitute $z$ for the given known probability density function (PDF) of $\epsilon$ and verify $z$ obtains its desired distribution. But this pure math knowledge has nothing to do with AI here and cannot give above VAE related equivalence insight, and one can always ask why not choose another reparameterization, say as $μ^2+σ/2∗ϵ$, where $ϵ$ is chosen to be some other distribution than $\mathcal{N}(0,1)$.

In the VAE, why is $z \sim \mathcal{N}(\mu, \sigma^2)$ equivalent to $z = \mu + \sigma \odot \epsilon$?

3 Answers3