How is $\Delta$ updated in true online TD($\lambda$)?

Question

In the RL textbook by Sutton & Barto section 7.4, the author talked about the "True online TD($\lambda$)". The figure (7.10 in the book) below shows the algorithm.

At the end of each step, $V_{old} \leftarrow V(S')$ and also $S \leftarrow S'$. When we jump to next step, $\Delta \leftarrow V(S') - V(S')$, which is 0. It seems that $\Delta$ is always going to be 0 after step 1. If that is true, it does not make any sense to me. Can you please elaborate on how $\Delta$ is updated?

David · Accepted Answer · 2020-06-03T18:41:12.397

Let us denote the state we are in at time $t$ by $S_t$. Then at iteration $t$ we create a placeholder $V_{old} = V(S_{t+1})$ for the state we will transition into. We then update the value function $V(s) \; \forall s \in \mathcal{S}$ - i.e. we update the value function for all states in our state space. Let us denote this updated value function by $V'(S)$.

At iteration $t+1$ we calculate $\Delta = V'(S_{t+1}) - V_{old} = V'(S_{t+1}) - V(S_{t+1})$, which does not necessarily equal 0 because the placeholder $V_{old}$ was created using the value function before the last update.

How is $\Delta$ updated in true online TD($\lambda$)?

1 Answers1