Deep Q-Learning: why don't we use mini-batches during experience reply?

Question

In examples and tutorial about DQN, I've often noticed that during the experience replay (training) phase people tend to use stochastic gradient descent / online learning. (e.g. link1, link2)

# Sample minibatch from the memory
minibatch = random.sample(self.memory, batch_size)
# Extract informations from each memory
for state, action, reward, next_state, done in minibatch:
    # if done, make our target reward
    target = reward
    if not done:
      # predict the future discounted reward
      target = reward + self.gamma * \
               np.amax(self.model.predict(next_state)[0])
    # make the agent to approximately map
    # the current state to future discounted reward
    # We'll call that target_f
    target_f = self.model.predict(state)
    target_f[0][action] = target

Why can't they use mini batches instead? I'm new to RL, but in deep learning people tends to use mini-batches as they would result in a more stable gradient. Doesn't the same principle apply to RL problems? Is the randomness/noise introduced actually beneficial to the learning process? Am I missing something, or are these sources all wrong?

Note:

Not all the sources rely on stochastic gradient descent: e.g. keras-rl seems to rely on minibatches (https://github.com/keras-rl/keras-rl/blob/master/rl/agents/dqn.py)

score 5 · Accepted Answer · answered Nov 05 '18 at 10:09

DQN "library" implementations that I have seen do use mini-batches to train, and I would generally recommend this, as it usually strikes a reasonable balance between number of weight updates and accuracy of the gradients.

In your first link, and the code excerpt, the sample list is literally called minibatch. However, the developer then goes on to make a series of individual SGD steps per item within that sample.

Why can't they use mini batches instead?

They can, and it works just fine. In my DQN explorations I tend to use a single mini-batch per timestep, and the Adam optimiser to accelerate learning. I also use a mini-batch to calculate TD target values.

I'm new to RL, but in deep learning people tends to use mini-batches as they would result in a more stable gradient. Doesn't the same principle apply to RL problems?

It does, but I don't know if this is the most sample efficient that it could be. I prefer to use mini-batches and more complex optimisers because this runs faster (for a given number of replay training samples per time step), and I can run more time steps of the environment in less time.

It is possible for instance that a replay sample size of 32 per time step, with 4 mini-batch updates of 8 each would be the most sample efficient approach for a specific problem (because it makes more updates from the given samples), but it might not be the most CPU efficient.

Is the randomness/noise introduced actually beneficial to the learning process?

I don't think so, but there are plenty of other tuning issues to consider in RL.

Am I missing something, or are these sources all wrong?

They are not necessarily wrong, but may be CPU inefficient.

More likely in my opinion, your sources are explaining algorithms in principle, and did not want to add too many layers of optimisation, or too many components that may need tuning to work. There is a lot going on in a typical DQN agent, and having many performance optimisations can hide the details that the person is trying to teach.

Deep Q-Learning: why don't we use mini-batches during experience reply?

1 Answers1