How to compute the target for double Q-learning update step?

Question

I've already read the original paper about double DQN but I do not find a clear and practical explanation of how the target $y$ is computed, so here's how I interpreted the method (let's say I have 3 possible actions (1,2,3)):

For each experience $e_{j}=(s_{j},a_{j},r_{j},s_{j+1})$ of the mini-batch (consider an experience where $a_{j}=2$) I compute the output through the main network in the state $s_{j+1}$, so I obtain 3 values.
I look which of the three is the highest so: $a^*=arg\max_{a}Q(s_{j+1},a)$, let's say $a^*=1$
I use the target network to compute the value in $a^*=1$ , so $Q_{target}(s_{j+1},1)$
I use the value at point 3 to substitute the value in the target vector associeted with the known action $a_{j}=2$, so: $Q_{target}(s_{j+1},2)\leftarrow r_{j}+\gamma Q_{target}(s_{j+1},1)$, while $Q_{target}(s_{j+1},1)$ and $Q_{target}(s_{j+1},3)$, which complete the target vector $y$, remain the same.

Is there anything wrong?

score 1 · Answer 1 · edited Aug 17 '20 at 00:13

$$Y_{t}^{\text {DoubleDQN }} \equiv R_{t+1}+\gamma Q\left(S_{t+1}, \underset{a}{\operatorname{argmax}} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}\right), \boldsymbol{\theta}_{t}^{-}\right)$$

The only difference between the "original" DQN and this one is that you use your $Q_\text{est}$ with the next state to get your action (by choosing the action with the highest Q).

Afterward, you just figure out what the target $Q$ is given that action, by selecting the $Q$ belonging to that action from the target_network (instead of using the argmax a directly on the target Q network).

About the formula

$\theta_{t}^{-}$ above it means frozen weights, so it represents the target Q network.
the other $\theta_{t}$ represents the "learnable weights" so the estimate Q network.

How to compute the target for double Q-learning update step?

1 Answers1