Derivation of Sutton & Barto TD(λ) Weight Update Equation with Eligibility Traces

Question

I'm working through Sutton & Barto's Reinforcement Learning: An Introduction, 2nd edition, and trying to understand the derivation of Equation 12.7 for TD(λ) weight updates in Chapter 12, specifically when using eligibility traces. Here’s the update equation in question:

$$\bf{w}_{t+1} = \bf{w}_t + \alpha \delta_t \bf{z}_t$$

In this equation:

$\bf{w}_{t+1}$ is the updated weight vector.
$\alpha$ is the learning rate.
$\delta_t$ is the TD error defined as: $\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \bf{w}_t) - \hat{v}(S_t, \bf{w}_t)$
$\bf{z}_t$ is the eligibility trace vector, which is recursively defined as: $\bf{z}_{t} = \gamma \lambda \bf{z}_{t-1} + \bf{x}(S_t)$

My Question:

I attempted to derive Equation 12.7 from first principles, but encountered an issue. Using the definitions provided, my derivation included an additional term: $$\bf{w}_{t+1} = \bf{w}_t + \alpha \delta_t (\bf{z}_{t} - \gamma \lambda \bf{z}_{t-1})$$

Expanding this leads to: $$\bf{w}_{t+1} = \bf{w}_t + \alpha \delta_t \bf{z}_t - \alpha \delta_t \gamma \lambda \bf{z}_{t-1}$$

This extra term $- \alpha \delta_t \gamma \lambda \bf{z}_{t-1}$ does not appear in Sutton & Barto's formulation. I suspect that the recursion within $\bf{z}_t$ might implicitly account for this term, but I would like to confirm that understanding and clarify why it can be omitted.

Key Points I'm Looking For:

A step-by-step derivation from the definitions, starting from the TD error and eligibility trace definitions to show how the final weight update form is obtained without needing the extra term.
An explanation, if possible, of how $\bf{z}_t$ itself encapsulates the historical contributions from prior time steps.

Any references to additional literature that validate this would also be very helpful. Thank you!

score 2 · Answer 1 · edited Nov 08 '24 at 06:04

Your concern about the extra term seems due to a symbolic error. When you compute $\bf{z}_t$, it already incorporates the contribution of the previous eligibility trace $\bf{z}_{t-1}$ due to the term $\gamma \lambda \bf{z}_{t-1}$ in your above definition which is (12.5) in S&B book, and the contributions of all previous eligibility traces according to their recency with its obvious recursive nature which can be shown from the below corrected weights update formula:

\begin{align*} \bf{w}_{t+1} &= \bf{w}_t + \alpha \delta_t (\gamma \lambda \bf{z}_{t-1} + \bf{x}(S_t)) \\ &=\bf{w}_t + \alpha \delta_t (\gamma \lambda (\gamma \lambda \bf{z}_{t-2} + \bf{x}(S_{t-1})) + \bf{x}(S_t)) \\ &= \bf{w}_t + \alpha \delta_t (\gamma^2 \lambda^2 \bf{z}_{t-2} + \gamma \lambda \bf{x}(S_{t-1}) + \bf{x}(S_t)) \\ &=... \end{align*}

The eligibility trace serves as a dynamic memory of past state contributions automatically decaying their influence over time. Therefore the adjustment for $\bf{z}_{t-1}$ does not need to be included in the approximated value function's weight vector update at time step $t$ as you claimed, as it is already built into $\bf{z}_t$ as shown above corrected formula.

Derivation of Sutton & Barto TD(λ) Weight Update Equation with Eligibility Traces

My Question:

Key Points I'm Looking For:

1 Answers1