I have been studying the KL divergence from Bishop's PRML book, and I feel like I have grasped the intuition behind it. However, I don't understand how one could use it when approximating an unknown distribution $p(x)$. As the book goes in paragraph 1.6.1:
Suppose we have observed a finite set of training points $x_n$ for $n=1,...,N$, drawn from $p(x)$. Then the expectation with respect to $p(x)$ can be approximated by a finite sum over these points, using $$\mathop{\mathbb{E}}[f] \approx \frac{1}{N}\sum_{n=1}^N f(x_n)$$ so that $$KL(p||q) \approx \frac{1}{N}\sum_{n=1}^N \left\{ -\ln q(x_n|\theta) + \ln p(x_n) \right\}$$
My doubt is on how to use this in practice. In particular, if $p(x_n)$ is an unknown distribution, from which we can only obtain data points, how can we evaluate $p(x_n)$? The only possible approach I can think of is approximating it with the empirical distribution but I am uncertain about how to do this for continuous distributions.
Thank you