2

I'm implementing the Watkins' Q(λ) algorithm with function approximation (in 2nd edition of Sutton & Barto). I am very confused about updating the eligibility traces because, at the beginning of chapter 9.3 "Control with Function Approximation", they are updated considering the gradient: $ e_t = \gamma \lambda e_{t-1} + \nabla \widehat{q}(S_t, A_t, w_t) $, as shown below.

                                      Update of the eligibility traces

Nevertheless, in Figure 9.9, for the exploitation phase the eligibility traces are updated without the gradient: $ e = \gamma \lambda e $.

                                      Update of the eligibility traces in the exploitation phase of Watkins' Q(λ)

Furthermore, by googling, I found that the gradient is simplified with the value of the i-th feature: $ \nabla \widehat{q}(S_t, A_t, w_t) = f_i(S_t, A_t)$.

I thought that, in Figure 9.9, the gradient is not considered because in the next step the eligibility traces are increased by 1 for the active features. So, the +1 can be seen as the value of the gradient, as I found on google, being binary features. But I'm not sure.

So, what is (and why) the right rule to update the eligibility traces?

1 Answers1

0

In the theoretical formula above, the update of the eligibility trace includes the gradient term $\nabla \hat{q}(S_t, A_t, w_t)$. But in the pseudo-code implementation below, the eligibility trace update does not explicitly include this gradient term, but uses a simplified way of feature representation.

This is because: The pseudo-code is for the special case of linear function approximation (linear function approximation with binary features), where the state action is represented by a feature set.

In linear function estimation, $\hat{q}(s, a, w)=\hat{w}^Tx(s,a)$, where $x(s,a)$ is a feature vector with 1 for each activated feature and 0 for the rest.

For this case, the gradient $\nabla \hat{q}(s, a, w) = x(s,a)$, and the clear feature itself works.

In the pseudocode, the set of activated feature indices of the current state-action pair is obtained through $F(S,A)$, and the eligibility trace elements corresponding to these indices are directly incremented ($e_i ← e_i + 1 \space or \space e_i ← 1$), which is actually implementing the addition of $\nabla \hat{q}(S_t, A_t, w_t)$.

Therefore, the operation of $e_i ← e_i + 1 \space or \space e_i ← 1$ (on the activated features) in the pseudocode actually corresponds to the term $\nabla \hat{q}(S_t, A_t, w_t)$ in the general formula, which is simplified by using the properties of equivalent feature solution in linear function approximation.

Simplification is a common implementation method of linear function approximation, making this algorithm more efficient in practical applications.

haha_kimi
  • 1
  • 1