I am training an agent (stable baselines3 algorithm) on a custom environment. During training, I want to have a callback so that for every $N$ steps of the learning process, I get the current model and run it on my environment $M$ times and log the average reward received, with sb3 evaluate_policy with deterministic set to true.
I have read that you should run the agent on a separate validation environment, not the same instance of the environment that the agent trains on, e.g. here https://stable-baselines3.readthedocs.io/en/v1.0/guide/rl_tips.html#how-to-evaluate-an-rl-algorithm
Because most algorithms use exploration noise during training, you need a separate test environment to evaluate the performance of your agent at a given time.
I don't understand the reasoning for this. I don't know why exploration noise affects what environment I should use and I can't figure out why its not possible to use the same instance of the environment class when evaluating the model. There is no variable in my environment that is specific to training / testing.
Is evaluating on the same environment instance as the model is trained on fine and I'm misinterpreting the general advice, or will my conclusions be wrong if I do it?
The reason I am asking is because I can only have one environment instance running at any one time.