19

I was going through this implementation of DQN and I see that on line 124 and 125 two different Q networks have been initialized. From my understanding, I think one network predicts the appropriate action and the second network predicts the target Q values for finding the Bellman error.

Why can we not just make one single network that simply predicts the Q value and use it for both the cases? My best guess that it's been done to reduce the computation time, otherwise we would have to find out the q value for each action and then select the best one. Is this the only reason? Am I missing something?

nbro
  • 42,615
  • 12
  • 119
  • 217
amitection
  • 327
  • 2
  • 6

2 Answers2

13

My best guess that it's been done to reduce the computation time, otherwise we would have to find out the q value for each action and then select the best one.

It has no real impact on computation time, other than a slight increase (due to extra memory used by two networks). You could cache results of the target network I suppose, but it probably would not be worth it for most environments, and I have not seen an implementation which does that.

Am I missing something?

It is to do with stability of the Q-learning algorithm when using function approximation (i.e. the neural network). Using a separate target network, updated every so many steps with a copy of the latest learned parameters, helps keep runaway bias from bootstrapping from dominating the system numerically, causing the estimated Q values to diverge.

Imagine one of the data points (at $s, a, r, s'$) causes a currently poor over-estimate for $q(s', a')$ to get worse. Maybe $s', a'$ has not even been visited yet, or the values of $r$ seen so far is higher than average, just by chance. If a sample of $(s, a)$ cropped up multiple times in experience replay, it would get worse again each time, because the update to $q(s,a)$ is based on the TD target $r + \text{max}_{a'} q(s',a')$. Fixing the target network limits the damage that such over-estimates can do, giving the learning network time to converge and lose more of its initial bias.

In this respect, using a separate target network has a very similar purpose to experience replay. It stabilises an algorithm that otherwise has problems converging.

It is also possible to have DQN with "double learning" to address a separate issue: Maximisation bias. In that case you may see DQN implementations with 4 neural networks.

Neil Slater
  • 33,739
  • 3
  • 47
  • 66
2

Here's an another explanation from Sergey Levine's CS285 lecture ppt

The update of DQN looks very much like SGD (stochastic gradient descent):

  1. take some action $\mathbf{a}_i$ and observe ( $\left.\mathbf{s}_i, \mathbf{a}_i, \mathbf{s}_i^{\prime}, r_i\right)$
  2. $\mathbf{y}_i=r\left(\mathbf{s}_i, \mathbf{a}_i\right)+\gamma \max _{\mathbf{a}^{\prime}} Q_\phi\left(\mathbf{s}_i^{\prime}, \mathbf{a}_i^{\prime}\right)$
  3. $\phi \leftarrow \phi-\alpha \frac{d Q_\phi}{d \phi}\left(\mathbf{s}_i, \mathbf{a}_i\right)\left(Q_\phi\left(\mathbf{s}_i, \mathbf{a}_i\right)-\mathbf{y}_i\right)$
    • This step just seems to be SGD of our target function $\frac{1}{2} \left\|Q_\phi\left(\mathbf{s}_i, \mathbf{a}_i\right)-\mathbf{y}_i\right\|^2$, right? And also SGD usually converges, right?

However, if you carefully look into the third step, you will find $\phi$ is contained not only in $Q_\phi$, but also in $y_i$. Hence, the correct SGD update should instead be:

$$ \phi \leftarrow \phi-\alpha \frac{d Q_\phi - y_i}{d \phi}\left(\mathbf{s}_i, \mathbf{a}_i\right)\left(Q_\phi\left(\mathbf{s}_i, \mathbf{a}_i\right)-\mathbf{y}_i\right) $$

  • This is called residual gradient method. But it has bad numerical stability and in practice it works very badly compared to even vanilla Q-Learning.

So since the vanilla Q-Learning is not gradient descent, and also don't have other good properties that leads to convergence (see Non-tabular value function learning in the Value Functions in Theory part of another lecture ppt), it can diverge.


So, how to fix this? Well, what we want is to actually make the third step an actual gradient step, but without using the residual gradient.

How to do this? Just to replace every $\phi$ in $y_i$ with something else, so $y_i$ will not be involved when taking gradient of $\phi$.

What to replace? We use an old $\phi'$ in $y_i$.

Thus, the new update of DQN looks like this:

  1. $\phi' \leftarrow \phi$
  2. take some action $\mathbf{a}_i$ and observe ( $\left.\mathbf{s}_i, \mathbf{a}_i, \mathbf{s}_i^{\prime}, r_i\right)$
  3. $\mathbf{y}_i=r\left(\mathbf{s}_i, \mathbf{a}_i\right)+\gamma \max _{\mathbf{a}^{\prime}} Q_{\phi'}\left(\mathbf{s}_i^{\prime}, \mathbf{a}_i^{\prime}\right)$
    • Note we have replaced the $\phi$ in $y_i$ with $\phi'$
  4. $\phi \leftarrow \phi-\alpha \frac{d Q_\phi}{d \phi}\left(\mathbf{s}_i, \mathbf{a}_i\right)\left(Q_\phi\left(\mathbf{s}_i, \mathbf{a}_i\right)-\mathbf{y}_i\right)$
  5. Goto (2) for K times. After that goto (1) and update our $\phi'$