3

I've been reading Google's DeepMind Atari paper and I'm trying to understand how to implement experience replay.

Do we update the parameters $\theta$ of function $Q$ once for all the samples of the minibatch, or do we do that for each sample of the minibatch separately?

According to the following code from this paper, it performs the gradient descent on loss term for the $j$-th sample. However, I have seen other papers (referring to this paper) that say that we first calculate the sum of loss terms for all samples of the minibatch and then perform the gradient descent on this sum of losses.

enter image description here

nbro
  • 42,615
  • 12
  • 119
  • 217
user491626
  • 241
  • 1
  • 5

1 Answers1

5

Gradient descent should be performed using the sum (or average) of the losses in the minibatch.

This is in fact also how I read the pseudocode in your question, though I understand it can be confusing. Note that, in the pseudocode, $j$ is not specified in detail. They do not, for example, have $j$ ranging from $0$ to the size of the minibatch.

When they say:

Sample random minibatch of transitions $\left(\phi_{j}, a_{j}, r_{j}, \phi_{j+1}\right)$ from $D$

they mean multiple transitions in the minibatch (with a minibatch size of $1$ being a special case), and they use the index $j$ to collectively refer to the entire set of indices in that randomly sampled minibatch. It's not one particular number / index, $j$ is a set of indices. When further lines of code do something with index $j$, they actually do something with all indices $j$.

nbro
  • 42,615
  • 12
  • 119
  • 217
Dennis Soemers
  • 10,519
  • 2
  • 29
  • 70