5

In the paper Learning to predict by the methods of temporal differences (p. 15), the weights in the temporal difference learning are updated as given by the equation $$ \Delta w_t = \alpha \left(P_{t+1} - P_t\right) \sum_{k=1}^{t}{\lambda^{t-k} \nabla_w P_k} \tag{4} \,.$$ When $\lambda = 0$, as in TD(0), how does the method learn? As it appears, with $\lambda = 0$, there will never be a change in weight and hence no learning.

Am I missing anything?

nbro
  • 42,615
  • 12
  • 119
  • 217
Amanda
  • 205
  • 1
  • 5

1 Answers1

5

When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0, there will never be a change in weight and hence no learning.

I think the detail that you're missing is that one of the terms in the sum (the final "iteration" of the sum, the case where $k = t$) has $\lambda$ raised to the power $0$, and anything raised to the power $0$ (even $0$) is equal to $1$. So, for $\lambda = 0$, your update equation becomes

$$\Delta w_t = \alpha \left( P_{t+1} - P_t \right) \nabla_w P_t,$$

which is basically a one-step update (just like Sarsa).

Dennis Soemers
  • 10,519
  • 2
  • 29
  • 70