2

I have been studying the KL divergence from Bishop's PRML book, and I feel like I have grasped the intuition behind it. However, I don't understand how one could use it when approximating an unknown distribution $p(x)$. As the book goes in paragraph 1.6.1:

Suppose we have observed a finite set of training points $x_n$ for $n=1,...,N$, drawn from $p(x)$. Then the expectation with respect to $p(x)$ can be approximated by a finite sum over these points, using $$\mathop{\mathbb{E}}[f] \approx \frac{1}{N}\sum_{n=1}^N f(x_n)$$ so that $$KL(p||q) \approx \frac{1}{N}\sum_{n=1}^N \left\{ -\ln q(x_n|\theta) + \ln p(x_n) \right\}$$

My doubt is on how to use this in practice. In particular, if $p(x_n)$ is an unknown distribution, from which we can only obtain data points, how can we evaluate $p(x_n)$? The only possible approach I can think of is approximating it with the empirical distribution but I am uncertain about how to do this for continuous distributions.

Thank you

pettepiero
  • 43
  • 3

1 Answers1

1

The goal here using KL divergence minimization is for optimized model fitting via the parameter $\theta$, not for approximating unknown distribution $p(x)$ directly. Notice in your quoted KL divergence expression the second term $\ln p(x_n)$ doesn't depend on the model parameter $\theta$ at all, and summation already assumes the empirical distribution substitution for the true unknown data distribution $p(x)$. Therefore under such assumption, KL divergence optimization is just to fit the consistent maximum likelihood estimator for all observed data in classical statistics under the usual regularity conditions. If there're enough observed data sampled independently from the unknown distribution $p(x)$ and if your model is correctly specified or at least flexible enough, your fitted model distribution would converge to $p(x)$ in probability implicitly.

cinch
  • 11,000
  • 3
  • 8
  • 17