2

I am training a model using A2C with stable baselines 2. When I increased the timesteps I noticed that episode rewards seem to reset (see attached plot). I don´t understand where these sudden decays or resets could come from and I am looking for practical experience or pointers to theory what these resets could imply.

Plot of eposide rewards

qwertz
  • 21
  • 1
  • could you please add some more information, e.g. what environment are you using, what is the reward space for that env, what exactly is the plot showing? – pi-tau Aug 14 '23 at 09:31

2 Answers2

0

i've seen this in learning using deep Q network. some tips may help you came over this problem :

  • use some remembering mechanisms like replay buffer. some times agent forget what it has been learnt. reply buffer remembers the agent what he saw at several episodes ago.
  • something else that worked for me was changing the optimizer. As DQN article says , using RMSprop is very useful for learning neural network based agents.

if these tips dont help you, please give more information about your agent.

0

I ran into this and learned that not only is observation normalization is as important as reward normalization.

The y-axis is showing this env reward is much higher than 1.0

Try rescaling your env reward such that it always falls within the range -1 to 1.

The reason why reward normalization is required is it affects the policy loss scaling. If you look at your policy loss you will also notice it much higher than 1.0. A policy loss that high makes training the neural network unstable

Pyrolistical
  • 101
  • 1