How can the importance sampling ratio be different than zero when the target policy is deterministic?

Question

In the book Reinforcement Learning: An Introduction (2nd edition) Sutton and Barto define at page 104 (p. 126 of the pdf), equation (5.3), the importance sampling ratio, $\rho _{t:T-1}$, as follows:

$$\rho _{t:T-1}=\prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$$

for a target policy $\pi$ and a behavior policy $b$.

However, on page 103, they state:

The target policy $\pi$ [...] may be deterministic [...].

When $\pi$ is deterministic and greedy it gives $1$ for the greedy action and 0 for all other possible actions.

So, how can the above formula give something else than zero, except for the case where policy $b$ takes a path that $\pi$ would have taken as well? If any selected action of $b$ is different from $\pi$'s choice, then the whole numerator is zero and thus the whole result.

Dennis Soemers · Accepted Answer · 2019-01-10T08:40:51.960

You're correct, when the target policy $\pi$ is deterministic, the importance sampling ratio will be $\geq 1$ along the trajectory where the behaviour policy $b$ happened to have taken the same actions that $\pi$ would have taken, and turns to $0$ as soon as $b$ makes one "mistake" (selects an action that $\pi$ would not have selected).

Before importance sampling is introduced in the book, I believe the only off-policy method you will have seen is one-step $Q$-learning, which can only propagate observations back along exactly one step. With the importance sampling ratio, you can often do a bit better. You're right, there is a risk that it turns to $0$ rather quickly (especially when $\pi$ and $b$ are very different from each other), at which point it essentially "truncates" your trajectory and ignores all subsequent experience... but that still can be better than one-step, there is a chance that the ratio will remain $1$ for at least a few steps. It will occasionally still only permit $1$-step returns, but also sometimes $2$-step returns, sometimes $3$-step returns, etc., which is often better than only having $1$-step returns.

Whenever the importance sampling ratio is not $0$, it can also give more emphasis to the observations resulting from trajectories that would be common under $\pi$, but are uncommon under $b$. Such trajectories will have a ratio $> 1$. Emphasizing such trajectories more can be beneficial, because they don't get experienced often under $b$, so without the extra emphasis it can be difficult to properly learn what would have happened under $\pi$.

Of course, it is also worth noting that your quote says (emphasis mine):

The target policy $\pi$ [...] may be deterministic [...]

It says that $\pi$ may be deterministic (and in practice it very often is, because we very often take $\pi$ to be the greedy policy)... but sometimes it won't be. The entire approach using the importance sampling ratio is well-defined also for cases where we choose $\pi$ not to be deterministic. In such situations, we'll often be able to propagate observations over significantly longer trajectories (although there is also a risk of excessive variance and/or numeric instability when $b$ selects actions that are highly unlikely according to $b$, but highly likely according to $\pi$).

score 1 · Answer 2 · edited Nov 05 '20 at 22:08

Good question. I think this part of the book is not well explained.

Off-policy evaluation of $V$ by itself doesn't make sense, IMO.

I think there are two cases here

is if $\pi$ is deterministic, as we probably want in the case of "control", i.e. we will determine $\pi$ to be deterministic and in every state choose the action that most likely to maximize the rewards/returns. In that case, then evaluating $V$ from a different distribution might not be so useful, as $W$ becomes $0$ with high likelihood. I don't see any sense in it.
if $\pi$ is not deterministic. And it's a good question why would we want to evaluate $V_\pi$ from $V_b$, instead of just evaluating it from $V_\pi$ directly.

So, IMO, off-policy evaluation of $V_\pi$ doesn't make any sense.

However, I think the goal here is actually the control algorithm given in the book (using $q(s,a)$, p. 111 of the book [133 of the pdf]). The idea here is to use some arbitrary behavior/exploratory policy and, while it runs, update ("control") the policy $\pi$. In there, you use the update rule for $W$, which uses the idea of importance sampling - i.e. how to update the expected value of $\pi$ based on $b$. But there it ACTUALLY makes sense.

So, I suspect the evaluation was given by itself just so the reader can better understand how to do the evaluation, though it really doesn't make sense outside the control algorithm.

How can the importance sampling ratio be different than zero when the target policy is deterministic?

2 Answers2

Linked

Related