A2C: Why do episode rewards reset?

Question

I am training a model using A2C with stable baselines 2. When I increased the timesteps I noticed that episode rewards seem to reset (see attached plot). I don´t understand where these sudden decays or resets could come from and I am looking for practical experience or pointers to theory what these resets could imply.

could you please add some more information, e.g. what environment are you using, what is the reward space for that env, what exactly is the plot showing? — pi-tau, Aug 14 '23 at 09:31

score 0 · Answer 1 · answered Feb 22 '22 at 10:08

i've seen this in learning using deep Q network. some tips may help you came over this problem :

use some remembering mechanisms like replay buffer. some times agent forget what it has been learnt. reply buffer remembers the agent what he saw at several episodes ago.
something else that worked for me was changing the optimizer. As DQN article says , using RMSprop is very useful for learning neural network based agents.

if these tips dont help you, please give more information about your agent.

score 0 · Answer 2 · answered Mar 17 '23 at 02:07

I ran into this and learned that not only is observation normalization is as important as reward normalization.

The y-axis is showing this env reward is much higher than 1.0

Try rescaling your env reward such that it always falls within the range -1 to 1.

The reason why reward normalization is required is it affects the policy loss scaling. If you look at your policy loss you will also notice it much higher than 1.0. A policy loss that high makes training the neural network unstable

A2C: Why do episode rewards reset?

2 Answers2