5

I have recently started working on a control problem using a Deep Q Network as proposed by DeepMind (https://arxiv.org/abs/1312.5602). Initially, I implemented it without Experience Replay. The results were very satisfying, although after implementing ER, the results I got were relatively bad. Thus I started experimenting with BATCH SIZE and MEMORY CAPACITY.

  • (1) I noticed that if I set BATCH SIZE = 1 and MEMORY CAPACITY = 1 i.e. the same as doing normal online learning as previously, the results are then (almost) the same as initially.

  • (2) If I increased CAPACITY and BATCH SIZE e.g. CAPACITY = 2000 and BATCH SIZE = 128, the Q Values for all actions tend to converge to very similar negative values.

A small negative reward -1 is received for every state transition except of the desired state which receives +10 reward. My gamma is 0.7. Every state is discrete and the environment can transition to a number of X states after action a, with every state in X having a significant probability.

Receiving a positive reward is very rare as getting to a desired state can take a long time. Thus, when sampling 128 experiences if 'lucky' only a small amount of experiences may have a positive reward.

Since, when doing mini-batch training we average the loss over all the samples and then update the DQN I was wondering whether generally the positive rewards can become meaningless as they are 'dominated' by the negative ones. Which means that this would result in a very slower convergence to actual values ? And also justifies the the convergence to similar negative values as in (2) ? Is this something expected? I am looking to implement. Prioritised ER as a potential solution to this, but is there something wrong inn the above logic?

I hope this does makes sense. Please forgive me if I make a wrong assumption above as I am new to the field.

Edit: The problem seemed to be that indeed finding rewards very rarely would result in sampling almost never, especially at the begging of training, which in turn resulted in very slow convergence to the actual Q values. The problem was successfully solved using Prioritised ER -but I believe any form of careful Stratified Sampling would result in good results

1 Answers1

1

What you describe sounds to me like a problem inherent to off policy learning, and what you describe seems to me to be a reasonable interpretation of what may be happening.

When you implemented experience replay with capacity = 1 and batch_size = 1 you said you got “almost” the same results as before. There are probably two reasons for this being “almost” the same. One is simply the random initializations of the networks, so as you train, you will potentially converge around the same point but not exactly to it (also the stochastic nature of generating the training samples). The other reason might be to do with what has already occurred each time you update the target net weights, so your error terms may differ slightly at each point in time, but asymptotically converge.

This is essentially following an on policy training, so every training sample is following a trajectory (they are all sequential states). Eventually, given enough time, this trajectory will reach a goal state and be rewarded. The reward will be propagated through the network and effect the backed up values of other states will be updated as well. So essentially each episode ends with a reward (I presume), and the average reward given per episode is proportional to the average length of an episode.

When increasing capacity and batch size > 1, we move to true off policy training. When sampling, the updates are not following a trajectory. And as such, there is no guarantee that we EVER sample any positive reward (although, I’m sure you will at some point). So, if we are averaging over the rewards in the updates, the average reward given per “episode” is no longer proportional to the average episode length (and the idea of episode starts to lose some of its relevance - since we aren’t following a trajectory). This the effect of the reward on all other states is not in proportion to what it was when following an on-policy trajectory.

You could try some hacks to investigate, for example try making your positive reward > batch_size. Or, if you have some statistics on how often your goal state is being sampled, perhaps scale it up by that factor. (Or if you know the size of your state space, make your reward greater than that).

This blog post offers some more elegant refinements, like prioritized ER which you mentioned. But it would be interesting to see the effect of scaling up your reward will overcome the effect of averaging over many negative rewards.

Hanzy
  • 519
  • 3
  • 11