Questions tagged [kl-divergence]

For questions related to the Kullback–Leibler (KL) divergence, which is a measure (that is not a metric, but it is pre-metric, because it does not satisfy all properties of metrics, i.e. it is not symmetric) of divergence (or distance) between two probability measures (density functions, or mass functions), which is commonly used in many machine learning settings, e.g. in the context of variational auto-encoders (VAES).

30 questions
22
votes
1 answer

Why has the cross-entropy become the classification standard loss function and not Kullback-Leibler divergence?

The cross-entropy is identical to the KL divergence plus the entropy of the target distribution. The KL divergence equals zero when the two distributions are the same, which seems more intuitive to me than the entropy of the target distribution,…
8
votes
2 answers

Why is KL divergence used so often in Machine Learning?

The KL Divergence is quite easy to compute in closed form for simple distributions -such as Gaussians- but has some not-very-nice properties. For example, it is not symmetrical (thus it is not a metric) and it does not respect the triangular…
7
votes
2 answers

How is this Pytorch expression equivalent to the KL divergence?

I found the following PyTorch code (from this link) -0.5 * torch.sum(1 + sigma - mu.pow(2) - sigma.exp()) where mu is the mean parameter that comes out of the model and sigma is the sigma parameter out of the encoder. This expression is apparently…
6
votes
1 answer

What is the impact of scaling the KL divergence and reconstruction loss in the VAE objective function?

Variational autoencoders have two components in their loss function. The first component is the reconstruction loss, which for image data, is the pixel-wise difference between the input image and output image. The second component is the…
6
votes
1 answer

Why is the evidence equal to the KL divergence plus the loss?

Why is the equation $$\log p_{\theta}(x^1,...,x^N)=D_{KL}(q_{\theta}(z|x^i)||p_{\phi}(z|x^i))+\mathbb{L}(\phi,\theta;x^i)$$ true, where $x^i$ are data points and $z$ are latent variables? I was reading the original variation autoencoder paper and I…
6
votes
2 answers

What are the advantages of the Kullback-Leibler over the MSE/RMSE?

I've recently encountered different articles that are recommending to use the KL divergence instead of the MSE/RMSE (as the loss function), when trying to learn a probability distribution, but none of the articles are giving a clear reasoning why…
5
votes
1 answer

Why is the Jensen-Shannon divergence preferred over the KL divergence in measuring the performance of a generative network?

I have read articles on how Jensen-Shannon divergence is preferred over Kullback-Leibler in measuring how good a distribution mapping is learned in a generative network because of the fact that JS-divergence better measures distribution similarity…
4
votes
2 answers

When should one prefer using Total Variational Divergence over KL divergence in RL

In RL, both the KL divergence (DKL) and Total variational divergence (DTV) are used to measure the distance between two policies. I'm most familiar with using DKL as an early stopping metric during policy updates to ensure the new policy doesn't…
4
votes
1 answer

What is the reason for mode collapse in GAN as opposed to WGAN?

In this article I am reading: $D_{KL}$ gives us inifity when two distributions are disjoint. The value of $D_{JS}$ has sudden jump, not differentiable at $\theta=0$. Only Wasserstein metric provides a smooth measure, which is super helpful for a…
3
votes
1 answer

How do you calculate KL divergence on a three-dimensional space for a Variational Autoencoder?

I'm trying to implement a variational auto-encoder (as seen in Section 3.1 here: https://arxiv.org/pdf/2004.06271.pdf). It differs from a traditional VAE because it encodes its input images to three-dimensional latent feature maps. In other words,…
3
votes
1 answer

Are there some notions of distance between two policies?

I want to determine some distance between two policies $\pi_1 (a \mid s)$ and $\pi_2 (a \mid s)$, i.e. something like $\vert \vert \pi_1 (a \mid s) - \pi_2(a \mid s) \vert \vert$, where $\pi_i (a\mid s)$ is the vector $(\pi_i (a_1 \mid s), \dots,…
3
votes
1 answer

Why does the KL divergence not satisfy the triangle inequality?

The KL divergence is defined as $$D_{KL}=\sum_i p(x_i)log\left(\frac{p(x_i)}{q(x_i)}\right)$$ Why does $D_{KL}$ not satisfy the triangle inequality? Also, can't you make it satisfy the triangle inequality by taking the absolute value of the…
user8714896
  • 825
  • 1
  • 9
  • 24
2
votes
1 answer

How to use the KL divergence in practice

I have been studying the KL divergence from Bishop's PRML book, and I feel like I have grasped the intuition behind it. However, I don't understand how one could use it when approximating an unknown distribution $p(x)$. As the book goes in paragraph…
2
votes
1 answer

Deriving ELBO for Diffusion Models

I am trying to read through the proof of ELBO for diffusion models on pg. 8 of this paper. However, I do not see how the author arrived at Eqn (45) from Eqn (44). Specifically, I do not know how they simplified the equation by rewriting it in terms…
2
votes
1 answer

How is this statement from a TensorFlow implementation of a certain KL-divergence formula related to the corresponding formula?

I am trying to understand a certain KL-divergence formula (which can be found on page 6 of the paper Evidential Deep Learning to Quantify Classification Uncertainty) and found a TensorFlow implementation for it. I understand most parts of the…
1
2