2

In the RL textbook by Sutton & Barto section 7.4, the author talked about the "True online TD($\lambda$)". The figure (7.10 in the book) below shows the algorithm.

At the end of each step, $V_{old} \leftarrow V(S')$ and also $S \leftarrow S'$. When we jump to next step, $\Delta \leftarrow V(S') - V(S')$, which is 0. It seems that $\Delta$ is always going to be 0 after step 1. If that is true, it does not make any sense to me. Can you please elaborate on how $\Delta$ is updated?

enter image description here

nbro
  • 42,615
  • 12
  • 119
  • 217
roy
  • 53
  • 3

1 Answers1

4

Let us denote the state we are in at time $t$ by $S_t$. Then at iteration $t$ we create a placeholder $V_{old} = V(S_{t+1})$ for the state we will transition into. We then update the value function $V(s) \; \forall s \in \mathcal{S}$ - i.e. we update the value function for all states in our state space. Let us denote this updated value function by $V'(S)$.

At iteration $t+1$ we calculate $\Delta = V'(S_{t+1}) - V_{old} = V'(S_{t+1}) - V(S_{t+1})$, which does not necessarily equal 0 because the placeholder $V_{old}$ was created using the value function before the last update.

David
  • 5,100
  • 1
  • 11
  • 33