3

I've already read the original paper about double DQN but I do not find a clear and practical explanation of how the target $y$ is computed, so here's how I interpreted the method (let's say I have 3 possible actions (1,2,3)):

  1. For each experience $e_{j}=(s_{j},a_{j},r_{j},s_{j+1})$ of the mini-batch (consider an experience where $a_{j}=2$) I compute the output through the main network in the state $s_{j+1}$, so I obtain 3 values.

  2. I look which of the three is the highest so: $a^*=arg\max_{a}Q(s_{j+1},a)$, let's say $a^*=1$

  3. I use the target network to compute the value in $a^*=1$ , so $Q_{target}(s_{j+1},1)$

  4. I use the value at point 3 to substitute the value in the target vector associeted with the known action $a_{j}=2$, so: $Q_{target}(s_{j+1},2)\leftarrow r_{j}+\gamma Q_{target}(s_{j+1},1)$, while $Q_{target}(s_{j+1},1)$ and $Q_{target}(s_{j+1},3)$, which complete the target vector $y$, remain the same.

Is there anything wrong?

nbro
  • 42,615
  • 12
  • 119
  • 217
unter_983
  • 331
  • 1
  • 7

1 Answers1

1

$$Y_{t}^{\text {DoubleDQN }} \equiv R_{t+1}+\gamma Q\left(S_{t+1}, \underset{a}{\operatorname{argmax}} Q\left(S_{t+1}, a ; \boldsymbol{\theta}_{t}\right), \boldsymbol{\theta}_{t}^{-}\right)$$

The only difference between the "original" DQN and this one is that you use your $Q_\text{est}$ with the next state to get your action (by choosing the action with the highest Q).

Afterward, you just figure out what the target $Q$ is given that action, by selecting the $Q$ belonging to that action from the target_network (instead of using the argmax a directly on the target Q network).

About the formula

  • $\theta_{t}^{-}$ above it means frozen weights, so it represents the target Q network.

  • the other $\theta_{t}$ represents the "learnable weights" so the estimate Q network.

nbro
  • 42,615
  • 12
  • 119
  • 217
hal9000
  • 439
  • 3
  • 9