5

Notice that, in the following formula, at the very right, the term multiplied with $\lambda$ is $d_i$

$$ w := w + \alpha \sum_{i=1}^{N-1} \nabla r(x_i^l, w) \Big \lfloor \sum_{j=i}^{N-1} \lambda^{j-i} d_i \Big \rfloor $$

Note that $i$ is the index of the first sum.

However, in the following formula, the term is $d_m$, which was $d_i$ in the first equation.

$$ w_j := w_j + \alpha \sum_{i=1}^{N-1} \frac{\partial r(x_i^l, w)}{\partial w_j} \Big \lfloor \sum_{m=i}^{N-1} \lambda^{m-i} d_m \Big \rfloor $$

Note that $m$ is the index of the second sum.

In my opinion, first equation seems reasonable. The $\lambda$ in second one actually seems to work like discount factor.

In another chess engine, Giraffe, the author cited same paper, noted the first equation, but then proceeded on to implement second one. See also KnightCap Paper.

nbro
  • 42,615
  • 12
  • 119
  • 217

2 Answers2

3

The second equation is correct. In TD($\lambda$), the $\lambda$ parameter can be tuned to smoothly vary between single-step updates (essentially what Sarsa does) in the case of $\lambda = 0$, and Monte-Carlo returns (using the full episode's returns) in the case of $\lambda = 1$.

In the first equation, $\sum_{j = i}^{N - 1} \lambda^{j - i} d_i$ could be interpreted as a... summing up exactly the same temporal-difference term $d_i$ a number of times (specifically, $N - 1 - j$ times), but multiplied by a different scalar every time. I'm not sure how that could be useful in any way.

In the second equation, $\sum_{m = i}^{N - 1} \lambda^{m - i} d_m$ can be interpreted as a weighted combination;

  • $1 \times d_i$: this is the difference between what we predict our returns will be at time $i + 1$, and what we previously predicted our returns would be at time $i$.
  • $+ \lambda^1 \times d_{i + 1}$: a very similar difference between two of our own predictions, now the predictions at time $i + 2$ and $i + 1$. This time the temporal-difference term is weighted by the parameter $\lambda$ (completely ignored if $\lambda = 0$, full weight if $\lambda = 1$, somewhere in between if $0 < \lambda < 1$).
  • $+ \lambda^2 \times d_{i + 2}$: again a similar temporal-difference term, now again one step further into the future. Downweighted a bit more than the previous term in cases where $0 < \lambda < 1$.
  • etc.

For people who are familiar with temporal-difference algorithms like TD($\lambda$), sarsa($\lambda$), eligibility traces, etc. from Reinforcement Learning literature, this makes a lot more sense. The notation is still a bit different from the standard literature on algorithms like TD($\lambda$), but in fact becomes equivalent once you note that in this paper they discuss domains where there are only rewards associated with terminal states, and no intermediate rewards.

Intuitively, what they're doing with the $\lambda$ parameter is assigning more weight (or "credit" or "importance") to short-term predictions / short-term "expectations" (in the English sense of the word, rather than the mathematical sense of the word) or observations of rewards, over long-term predictions/observations. In the extreme case of $\lambda = 0$, you completely ignore long-term predictions/observations and only propagate observed rewards very slowly, one-by-one in single steps. In the other extreme case of $\lambda = 1$, you propagate rewards observed at the end of episodes with equal weight all the way to the beginning of the episodes, through all states that you went to, giving them all equal weight for that observed reward. With $0 < \lambda < 1$, you choose a balance between those two extremes.


Also note that Equation (5) in the KnightCap paper (where they similarly discuss the extreme case of $\lambda = 1$, like I did above) is incorrect if we take the first equation from your question, but is correct if we take the second equation.

Dennis Soemers
  • 10,519
  • 2
  • 29
  • 70
0

Of course, it's the first equation that is correct. d is the temporal difference, and it's the difference between the current and next state. The current state is i, and this difference needs to stay constant inside the summation loop.

The second equation is simply a typo mistake.

SmallChess
  • 1,421
  • 1
  • 9
  • 14