This is question 6.13 in Sutton-Barto,page 136.
What are the update equations for Double Expected Sarsa with an $\epsilon$-greedy target policy?
The answer is given as follows:
Let $Q_1$ and $Q_2$ be the two action-value functions and let $\pi_1$ and $\pi_2$ be their $\epsilon$-greedy policies, respectively. Then, the updates would be:
$Q_1(S_t, A_t) \leftarrow Q_1(S_t, A_t)+\alpha\Big[R_{t+1}+\gamma \displaystyle \sum_{a}\pi_2(a|S_{t+1})Q_1(S_{t+1},a)-Q_1(S_t, A_t)\Big] $
$Q_2(S_t, A_t) \leftarrow Q_2(S_t, A_t)+\alpha\Big[R_{t+1}+\gamma \displaystyle \sum_{a}\pi_1(a|S_{t+1})Q_2(S_{t+1},a)-Q_2(S_t, A_t)\Big] $
Could someone explain how they were obtained similar to a reasoning corresponding to double Q-learning?