5

I am training an agent (stable baselines3 algorithm) on a custom environment. During training, I want to have a callback so that for every $N$ steps of the learning process, I get the current model and run it on my environment $M$ times and log the average reward received, with sb3 evaluate_policy with deterministic set to true.

I have read that you should run the agent on a separate validation environment, not the same instance of the environment that the agent trains on, e.g. here https://stable-baselines3.readthedocs.io/en/v1.0/guide/rl_tips.html#how-to-evaluate-an-rl-algorithm

Because most algorithms use exploration noise during training, you need a separate test environment to evaluate the performance of your agent at a given time.

I don't understand the reasoning for this. I don't know why exploration noise affects what environment I should use and I can't figure out why its not possible to use the same instance of the environment class when evaluating the model. There is no variable in my environment that is specific to training / testing.

Is evaluating on the same environment instance as the model is trained on fine and I'm misinterpreting the general advice, or will my conclusions be wrong if I do it?

The reason I am asking is because I can only have one environment instance running at any one time.

nbro
  • 42,615
  • 12
  • 119
  • 217

1 Answers1

0

I assume this is because you don't want to modify your training environment while evaluating. For example, if your evaluation triggers halfway through a training episode, the evaluation will need to call things like env.reset(), which will mess up the state of your training environment. It might be possible to cache the state of your training environment before evaluation, but this is much more complicated than just having separate training and evaluation environments.