2

In per-decison importance sampling given in Sutton & Barto's book:

Eq 5.12 $\rho_{t:T-1}R_{t+k} = \frac{\pi(A_{t}|S_{t})}{b(A_{t}|S_{t})}\frac{\pi(A_{t+1}|S_{t+1})}{b(A_{t+1}|S_{t+1})}\frac{\pi(A_{t+2}|S_{t+2})}{b(A_{t+2}|S_{t+2})}......\frac{\pi(A_{T-1}|S_{T-1})}{b(A_{T-1}|S_{T-1})}R_{t+k}$

Eq 5.13 $\mathbb{E}\left[\frac{\pi(A_{k}|S_{k})}{b(A_{k}|S_{k})}\right] = \displaystyle\sum_ab(a|S_k)\frac{\pi(A_{k}|S_{k})}{b(A_{k}|S_{k})} = \displaystyle\sum_a\pi(a|S_k) = 1$

Eq.5.14 $\mathbb{E}[\rho_{t:T-1}R_{t+k}] = \mathbb{E}[\rho_{t:t+k-1}R_{t+k}]$

As full derivation is not given, how do we arrive at Eq 5.14 from 5.12?

From what i understand :

1) $R_{t+k}$ is only dependent on action taken at $t+k-1$ given state at that time i.e. only dependent on $\frac{\pi(A_{t+k-1}|S_{t+k-1})}{b(A_{t+k-1}|S_{t+k-1})}$

2) $\frac{\pi(A_{k}|S_{k})}{b(A_{k}|S_{k})}$ is independent of $\frac{\pi(A_{k+1}|S_{k+1})}{b(A_{k+1}|S_{k+1})}$ , so $\mathbb{E}\left[\frac{\pi(A_{k}|S_{k})}{b(A_{k}|S_{k})}\frac{\pi(A_{k+1}|S_{k+1})}{b(A_{k+1}|S_{k+1})}\right] = \mathbb{E}\left[\frac{\pi(A_{k}|S_{k})}{b(A_{k}|S_{k})}\right]\mathbb{E}\left[\frac{\pi(A_{k+1}|S_{k+1})}{b(A_{k+1}|S_{k+1})}\right], \forall \, k\in [t,T-2]$

Hence, $\mathbb{E}[\rho_{t:T-1}R_{t+k}]= \mathbb{E}\left[\frac{\pi(A_{t}|S_{t})}{b(A_{t}|S_{t})}\frac{\pi(A_{t+1}|S_{t+1})}{b(A_{t+1}|S_{t+1})}\frac{\pi(A_{t+2}|S_{t+2})}{b(A_{t+2}|S_{t+2})}......\frac{\pi(A_{T-1}|S_{T-1})}{b(A_{T-1}|S_{T-1})}R_{t+k}\right] \\= \mathbb{E}\left[\frac{\pi(A_{t}|S_{t})}{b(A_{t}|S_{t})}\frac{\pi(A_{t+1}|S_{t+1})}{b(A_{t+1}|S_{t+1})}\frac{\pi(A_{t+2}|S_{t+2})}{b(A_{t+2}|S_{t+2})}....\frac{\pi(A_{t+k-2}|S_{t+k-2})}{b(A_{t+k-2}|S_{t+k-2})}\frac{\pi(A_{t+k}|S_{t+k})}{b(A_{t+k}|S_{t+k})}......\frac{\pi(A_{T-1}|S_{T-1})}{b(A_{T-1}|S_{T-1})}\right]\mathbb{E}\left[\frac{\pi(A_{t+k-1}|S_{t+k-1})}{b(A_{t+k-1}|S_{t+k-1})}R_{t+k}\right] \\= \mathbb{E}\left[\frac{\pi(A_{t}|S_{t})}{b(A_{t}|S_{t})}\right]\mathbb{E}\left[\frac{\pi(A_{t+1}|S_{t+1})}{b(A_{t+1}|S_{t+1})}\right]\mathbb{E}\left[\frac{\pi(A_{t+2}|S_{t+2})}{b(A_{t+2}|S_{t+2})}\right]....\mathbb{E}\left[\frac{\pi(A_{t+k-2}|S_{t+k-2})}{b(A_{t+k-2}|S_{t+k-2})}\right]\mathbb{E}\left[\frac{\pi(A_{t+k}|S_{t+k})}{b(A_{t+k}|S_{t+k})}\right]......\mathbb{E}\left[\frac{\pi(A_{T-1}|S_{T-1})}{b(A_{T-1}|S_{T-1})}\right]\mathbb{E}\left[\frac{\pi(A_{t+k-1}|S_{t+k-1})}{b(A_{t+k-1}|S_{t+k-1})}R_{t+k}\right] \\= \mathbb{E}[\frac{\pi_{t+k-1}}{b_{t+k-1}}R_{t+k}]\\=\mathbb{E}[\rho_{t+k-1}R_{t+k}]$

which is not equal to eq 5.14. What's the mistake in the above calculations? Are 1 and 2 correct?

ZERO NULLS
  • 147
  • 10

2 Answers2

1

As mentioned in the comments your assumption about independence is wrong. Here's why. To prove independence we need to show the following holds:

$$P(X=x, Y=y) = P(X=x)P(Y=y)$$

in the case of RL this becomes:

$$P(X=a, X=a') = P(X=a)P(Y=a')$$

The left hand side has the value:

$$P(X=a, Y=a') = b(A_t = a| S_t = s) p(s'|a,s) b(A_{t+1} = a'|, S_{t+1} = s')$$

while the right hand side has the value:

$$P(X=a)P(Y=a') = b(A_t = a| S_t = s)b(A_{t+1} = a'| S_{t+1} = s')$$

And hence not independent.

Now let use look at why the following expression holds:

Eq.5.14: $\mathbb{E}[\rho_{t:T-1}R_{t+k}] = \mathbb{E}[\rho_{t:t+k-1}R_{t+k}]$

I will not derive the exact expressions, but I hope you can form the reasoning I provide. By the rules of probability we know that sum of joint probability is equal to 1 i.e.:

$$\sum_{X_1..X_n} P(X_1=a_1, X_2=a_2,...X_n = a_n) = 1$$

I have alredy showed above, the trajectory is not independent. So $R_{t+k}$ will depend on the trajectory $S_{t:t+k-1}$ where $S_{t:t+k-1}$ is a particular trajectory. At the end of this trajectory we get a reward $R_{t+k}$ and thus $R_{t+k}$ is exclusively a function of $S_{t:t+k-1}$ i.e. $R_{t+k} = f(S_{t:t+k-1})$. The trajectory after this $S_{t+k:T-1}$ irrelevant since it will always sum up of to 1. i.e once you have reached a particular state at time step $t+k-1$ you are now conditioning based on that $P(S_{t+k:T-1}|S_{t:t+k-1})$ and taking the expected value over all trajectories possible from thereon i.e. $\sum_{S_{t+k:T-1}} P(S_{t+k:T-1}|S_{t:t+k-1})$ which is 1 by probability rules. Thus, what you are really doing is:

$$P(S_{t:t+k-1})R_{t+k}(\sum_{S_{t+k:T-1}} P(S_{t+k:T-1}|S_{t:t+k-1}))$$

and hence the remaining trajectory has no contribution.

Another way of thinking this is you are taking weighted trajectories till time step $t+k-1$ weighted by rewards $R_{t+k}$ and hence you cannot sum up to 1. The rest of the trajectory after $t+k-1$ will sum up to 1.

I hope this qualitative description suffices. You can do the maths, but you must be careful with the notations and the assumptions you make.

Also all the equations are correct, I hope you can indirectly see it from my reasoning.

0

First Part

We can reduce variance in off-policy importance sapling, even in the absence of discounting ($\gamma = 1$). Notice that the off-policy estimators are made up of terms like $$\rho_{t:T-1}G_t = \rho_{t:T-1} (R_{t+1} + \gamma R_{t+2} + \dots+ \gamma^{T-t-1}R_{T})$$

and consider the second term, imagine $\gamma$=$1$: $$\rho_{t:T-1}R_{t+2} = \frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1})......\pi(A_{T-1}|S_{T-1})}{b(A_t|S_t) b(A_{t+1}|S_{t+1})...... b(A_{T-1}|S_{T-1})} R_{t+2}$$ In above equation, the term $\pi(A_t|S_t)$, $\pi(A_{t+1}|S_{t+1})$, $R_{t+2}$ are correleated, all the other terms are independent of each other.

Notice the very import property of expectation: $E[ab] = E[a] E[b]$ if and only if $a$, $b$ are independent random variables.

Now: $$ E[\frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1}).....\pi(A_{T-1}|S_{T-1})}{b(A_t|S_t) b(A_{t+1}|S_{t+1}).....b(A_{T-1}|S_{T-1})} R_{t+2}]$$ $$ = E[\frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1})}{b(A_t|S_t) b(A_{t+1}|S_{t+1})} R_{t+2}] E[\frac{\pi(A_{t+2}|S_{t+2})}{b(A_{t+2}|S_{t+2})}] ..... E[\frac{\pi(A_{T-1}|S_{T-1})}{b(A_{T-1}|S_{T-1})}]$$ $$ = E[\frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1})}{b(A_t|S_t) b(A_{t+1}|S_{t+1})} R_{t+2}] \sum_a b(a|s_{t+2}) \frac{\pi(a|s_{t+2}}{b(a|s_{t+2}}.....\sum_a b(a|s_{T-1}) \frac{\pi(a|s_{T-1}}{b(a|s_{T-1}} $$ $$ = E[\frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1})}{b(A_t|S_t) b(A_{t+1}|S_{t+1})} R_{t+2}] \sum_a \pi(a|s_{t+2}).....\sum_a \pi(a|s_{T-1})$$
$$ = E[\frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1})}{b(A_t|S_t) b(A_{t+1}|S_{t+1})} R_{t+2}] 1 * 1 $$ $$ = E[\frac{\pi(A_t|S_t) \pi(A_{t+1}|S_{t+1})}{b(A_t|S_t) b(A_{t+1}|S_{t+1})} R_{t+2}] $$ therefore $$ E[\rho_{t:T-1}R_{t+2}] = E[\rho_{t:t+1} R_{t+2}]$$ If we repeat this analysis for the $k$th term, we will get: $$E[\rho_{t:T-1}R_{t+k}] = E[\rho_{t:t+k-1} R_{t+k}]$$ It follows that the expectation of our original term can be written: $$E[\rho_{t:T-1}G_{t}] = E[\tilde{G_{t}}]$$ where $$\tilde{G}_t \doteq \rho_{t:t}R_{t+1} + \gamma \rho_{t:t+1}R_{t+2} + \gamma^{2} \rho_{t:t+2}R_{t+3} + ...... + \gamma^{T-t-1} \rho_{t:T-1}R_{T}$$ We call this idea per reward importance sampling. It follows immediately that there is an alternative importance sampling estimate, with the same unbiased expectation as the ordinary importance sampling estimate: $$V(s) \doteq \frac{\sum_{t\in\mathcal{T}(s)} \tilde{G}_t}{|\mathcal{T}(s)|}$$ which we might expect to sometimes be of lower variance.

Second Part

The reward $R_{k+1}$ depends on the previous $\pi(a_1|s_1)$ up to $\pi(a_{k-1}|s_{k-1})$. So, you can't seperate them and treat them as independent variable just you did on the aforementioned example.

Swakshar Deb
  • 703
  • 4
  • 12