Understanding the equation of TD(0) in the paper "Learning to predict by the methods of temporal differences"

Question

In the paper Learning to predict by the methods of temporal differences (p. 15), the weights in the temporal difference learning are updated as given by the equation $$ \Delta w_t = \alpha \left(P_{t+1} - P_t\right) \sum_{k=1}^{t}{\lambda^{t-k} \nabla_w P_k} \tag{4} \,.$$ When $\lambda = 0$, as in TD(0), how does the method learn? As it appears, with $\lambda = 0$, there will never be a change in weight and hence no learning.

Am I missing anything?

score 5 · Accepted Answer · answered Jun 01 '19 at 15:07

When lambda = 0 as in TD(0), how does the method learn? As it appears, with lambda = 0, there will never be a change in weight and hence no learning.

I think the detail that you're missing is that one of the terms in the sum (the final "iteration" of the sum, the case where $k = t$) has $\lambda$ raised to the power $0$, and anything raised to the power $0$ (even $0$) is equal to $1$. So, for $\lambda = 0$, your update equation becomes

$$\Delta w_t = \alpha \left( P_{t+1} - P_t \right) \nabla_w P_t,$$

which is basically a one-step update (just like Sarsa).

Understanding the equation of TD(0) in the paper "Learning to predict by the methods of temporal differences"

1 Answers1