I'm working through Sutton & Barto's Reinforcement Learning: An Introduction, 2nd edition, and trying to understand the derivation of Equation 12.7 for TD(λ) weight updates in Chapter 12, specifically when using eligibility traces. Here’s the update equation in question:
$$\bf{w}_{t+1} = \bf{w}_t + \alpha \delta_t \bf{z}_t$$
In this equation:
- $\bf{w}_{t+1}$ is the updated weight vector.
- $\alpha$ is the learning rate.
- $\delta_t$ is the TD error defined as: $\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \bf{w}_t) - \hat{v}(S_t, \bf{w}_t)$
- $\bf{z}_t$ is the eligibility trace vector, which is recursively defined as: $\bf{z}_{t} = \gamma \lambda \bf{z}_{t-1} + \bf{x}(S_t)$
My Question:
I attempted to derive Equation 12.7 from first principles, but encountered an issue. Using the definitions provided, my derivation included an additional term: $$\bf{w}_{t+1} = \bf{w}_t + \alpha \delta_t (\bf{z}_{t} - \gamma \lambda \bf{z}_{t-1})$$
Expanding this leads to: $$\bf{w}_{t+1} = \bf{w}_t + \alpha \delta_t \bf{z}_t - \alpha \delta_t \gamma \lambda \bf{z}_{t-1}$$
This extra term $- \alpha \delta_t \gamma \lambda \bf{z}_{t-1}$ does not appear in Sutton & Barto's formulation. I suspect that the recursion within $\bf{z}_t$ might implicitly account for this term, but I would like to confirm that understanding and clarify why it can be omitted.
Key Points I'm Looking For:
- A step-by-step derivation from the definitions, starting from the TD error and eligibility trace definitions to show how the final weight update form is obtained without needing the extra term.
- An explanation, if possible, of how $\bf{z}_t$ itself encapsulates the historical contributions from prior time steps.
Any references to additional literature that validate this would also be very helpful. Thank you!