7

In the book Reinforcement Learning: An Introduction (2nd edition) Sutton and Barto define at page 104 (p. 126 of the pdf), equation (5.3), the importance sampling ratio, $\rho _{t:T-1}$, as follows:

$$\rho _{t:T-1}=\prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_k)}$$

for a target policy $\pi$ and a behavior policy $b$.

However, on page 103, they state:

The target policy $\pi$ [...] may be deterministic [...].

When $\pi$ is deterministic and greedy it gives $1$ for the greedy action and 0 for all other possible actions.

So, how can the above formula give something else than zero, except for the case where policy $b$ takes a path that $\pi$ would have taken as well? If any selected action of $b$ is different from $\pi$'s choice, then the whole numerator is zero and thus the whole result.

nbro
  • 42,615
  • 12
  • 119
  • 217
F.M.F.
  • 321
  • 3
  • 7

2 Answers2

5

You're correct, when the target policy $\pi$ is deterministic, the importance sampling ratio will be $\geq 1$ along the trajectory where the behaviour policy $b$ happened to have taken the same actions that $\pi$ would have taken, and turns to $0$ as soon as $b$ makes one "mistake" (selects an action that $\pi$ would not have selected).

Before importance sampling is introduced in the book, I believe the only off-policy method you will have seen is one-step $Q$-learning, which can only propagate observations back along exactly one step. With the importance sampling ratio, you can often do a bit better. You're right, there is a risk that it turns to $0$ rather quickly (especially when $\pi$ and $b$ are very different from each other), at which point it essentially "truncates" your trajectory and ignores all subsequent experience... but that still can be better than one-step, there is a chance that the ratio will remain $1$ for at least a few steps. It will occasionally still only permit $1$-step returns, but also sometimes $2$-step returns, sometimes $3$-step returns, etc., which is often better than only having $1$-step returns.

Whenever the importance sampling ratio is not $0$, it can also give more emphasis to the observations resulting from trajectories that would be common under $\pi$, but are uncommon under $b$. Such trajectories will have a ratio $> 1$. Emphasizing such trajectories more can be beneficial, because they don't get experienced often under $b$, so without the extra emphasis it can be difficult to properly learn what would have happened under $\pi$.


Of course, it is also worth noting that your quote says (emphasis mine):

The target policy $\pi$ [...] may be deterministic [...]

It says that $\pi$ may be deterministic (and in practice it very often is, because we very often take $\pi$ to be the greedy policy)... but sometimes it won't be. The entire approach using the importance sampling ratio is well-defined also for cases where we choose $\pi$ not to be deterministic. In such situations, we'll often be able to propagate observations over significantly longer trajectories (although there is also a risk of excessive variance and/or numeric instability when $b$ selects actions that are highly unlikely according to $b$, but highly likely according to $\pi$).

Dennis Soemers
  • 10,519
  • 2
  • 29
  • 70
1

Good question. I think this part of the book is not well explained.

Off-policy evaluation of $V$ by itself doesn't make sense, IMO.

I think there are two cases here

  1. is if $\pi$ is deterministic, as we probably want in the case of "control", i.e. we will determine $\pi$ to be deterministic and in every state choose the action that most likely to maximize the rewards/returns. In that case, then evaluating $V$ from a different distribution might not be so useful, as $W$ becomes $0$ with high likelihood. I don't see any sense in it.

  2. if $\pi$ is not deterministic. And it's a good question why would we want to evaluate $V_\pi$ from $V_b$, instead of just evaluating it from $V_\pi$ directly.

So, IMO, off-policy evaluation of $V_\pi$ doesn't make any sense.

However, I think the goal here is actually the control algorithm given in the book (using $q(s,a)$, p. 111 of the book [133 of the pdf]). The idea here is to use some arbitrary behavior/exploratory policy and, while it runs, update ("control") the policy $\pi$. In there, you use the update rule for $W$, which uses the idea of importance sampling - i.e. how to update the expected value of $\pi$ based on $b$. But there it ACTUALLY makes sense.

So, I suspect the evaluation was given by itself just so the reader can better understand how to do the evaluation, though it really doesn't make sense outside the control algorithm.

nbro
  • 42,615
  • 12
  • 119
  • 217
Maverick Meerkat
  • 422
  • 4
  • 11