4

In RL, both the KL divergence (DKL) and Total variational divergence (DTV) are used to measure the distance between two policies. I'm most familiar with using DKL as an early stopping metric during policy updates to ensure the new policy doesn't deviate much from the old policy.

I've seen DTV mostly being used in papers giving approaches to safe RL when placing safety constraints on action distributions. Such as in Constrained Policy Optimization and Lyapunov Approach to safe RL.

I've also seen that they are related by this formula:

$$ D_{TV} = \sqrt{0.5 D_{KL}} $$

When you compute the $D_{KL}$ between two polices, what does that tell you about them, and how is it different from what a $D_{TV}$ between the same two policies tells you?

Based on that, are there any specific instances to prefer one over the other?

nbro
  • 42,615
  • 12
  • 119
  • 217
mugoh
  • 549
  • 4
  • 21

2 Answers2

2

To add to nbro's answer, I'd say also that much of the time the distance measure isn't simply a design decision, rather it comes up naturally from the model of the problem. For instance, minimizing the KL divergence between your policy and the softmax of the Q values at a given state is equivalent to policy optimization where the optimality at a given state is Bernoulli with respect to the exponential of the reward (see maximum entropy RL algorithms). As another example, the KL divergence in the VAE loss is a result of the model and not just a blind decision.

I'm less familiar with total variation distance, but I know there's a nice relationship between the total variation distance of a state probability vector and a Markov chain stationary distribution relative to the timestep and the mixing time of the chain.

Finally, another thing to consider is the properties of the gradients of these divergence measures. Note that the gradient of the total variation distance might blow up as the distance tends to $0$. Additionally, one must consider if unbiased estimators of the gradients from samples can be feasible. While this is generally the case with KL divergence, I'm not sure about total variation distance (as in, I literally don't know), and this is generally not the case with the Wasserstein metric (see Marc G. Bellemare et. al's paper "The Cramér distance as a solution to biased wasserstein gradients"). However, of course there's other scenarios where the tables are turned -- for instance, the distributional bellman operator is a contraction in the supremal Wasserstein metric but not in KL or total variation distance.

TL; DR: Many times mathematical/statistical constraints suggest particular metrics.

harwiltz
  • 1,166
  • 1
  • 8
  • 6
1

I did not read those two specified linked/cited papers and I am not currently familiar with the total variation distance, but I think I can answer some of your questions, given that I am reasonably familiar with the KL divergence.

When you compute the $D_{KL}$ between two polices, what does that tell you about them

The KL divergence is a measure of "distance" (or divergence, as the name suggests) between two probability distributions (i.e. probability measures) or probability densities. In reinforcement learning, (stochastic) policies are probability distributions. For example, in the case your Markov decision process (MDP) has a discrete set of actions, then your policy can be denoted as $$\pi(a \mid s),$$which is the conditional probability distribution over all possible actions, given a specific state $s$. Hence, the KL divergence is a natural measure of how two policies are similar or different.

There are 4 properties of the KL divergence that you always need to keep in mind

  1. It is asymmetric, i.e., in general, $D_{KL}(q, p) \neq D_{KL}(p, q)$ (where $p$ and $q$ are p.d.s); consequently, the KL divergence cannot be a metric (because metrics are symmetric!)
  2. It is always non-negative
  3. It is zero when $p = q$.
  4. It is unbounded, i.e. it can be arbitrarily large; so, in other words, two probability distributions can be infinitely different, which may not be very intuitive: in fact, in the past, I used the KL divergence and, because of this property, it wasn't always clear how I should interpret the KL divergence (but this may also be due to my not extremely solid understanding of this measure).

and how is it different from what a $D_{TV}$ between the same two policies tells you?

$D_{TV}$ is also a measure of the distance between two probability distributions, but it is bounded, specifically, in the range $[0, 1]$ [1]. This property may be useful in some circumstances (which ones?). In any case, the fact that it lies in the range $[0, 1]$ potentially makes its interpretation more intuitive. More precisely, if you know the maximum and minimum values that a measure can give you, you can have a better idea of the relative difference between probability distributions. For instance, imagine that you have p.d.s $q$, $p$ and $p'$. If you compute $D_{TV}(q, p)$ and $D_{TV}(q, p')$, you can have a sense (in terms of percentage) of how much $p'$ and $p$ differ with respect to $q$.

The choice between $D_{TV}$ and $D_{KL}$ is probably motivated by their specific properties (and it will probably depend on a case by case basis, and I expect the authors of the research papers to motivate the usage of a specific measure/metric). However, keep in mind that there is not always a closed-form solution not even to calculate the KL divergence, so you may need to approximate it (e.g. by sampling: note that the KL divergence is defined as an expectation/integral so you can approximate it with a sampling technique). So, this (computability and/or approximability) may also be a parameter to take into account when choosing one over the other.

By the way, I think that your definition of the total variational divergence is wrong, although the DTV is related to the DKL, specifically, as follows [1]

\begin{align} D_{TV} \leq \sqrt{\frac{1}{2} D_{KL}} \end{align}

So the DTV is bounded by the KL divergence. Given that the KL divergence is unbounded (e.g. it can take very big values, such as 600k, this bound should be very loose).

Take a look at the paper On choosing and bounding probability metrics (2002, by Alison L. Gibbs and Francis Edward Su) or this book for information about $D_{TV}$ (and other measures/metrics).

nbro
  • 42,615
  • 12
  • 119
  • 217