Cannot reproduce evaluation scores of "EvalCallback" - stable_baselines3

Question

I trained my PPO model while tracking the models' performance each 20k steps using the EvalCallback wrapper:

vec_env = make_vec_env(env_id=env_id, n_envs=1)
policy_kwargs = dict(activation_fn=t.nn.Tanh, net_arch=dict(pi=[64, 64], vf=[64, 64]))
eval_env = make_vec_env(env_id=env_id)
eval_callback = EvalCallback(eval_env, best_model_save_path=f"./{check_root_dir}/{run}/{mod}",
                 log_path=f"./{check_root_dir}/{run}/{mod}", eval_freq=eval_freq,
                 deterministic=True, render=True, n_eval_episodes=10)
device = "cuda" if t.cuda.is_available() else "cpu"
model = PPO("MultiInputPolicy", vec_env, verbose=0, tensorboard_log=dir, policy_kwargs=policy_kwargs,
            ent_coef=0.0055,
            vf_coef=0.6,
            gamma=0.8,
            learning_rate=0.001,
            device=device)  # For non-dict observation space
model.learn(total_timesteps=2_000_000, callback=eval_callback)

After $n$ evaluation steps and subsequent calls to EvalCallback, I get the best mean_reward value of -6.44 (according to my tensorboard and evaluations.npz) and the model is automatically saved to the specified path.

To reproduce this evaluation-score, I tried this:

from stable_baselines3.common.evaluation import evaluate_policy
render = None  # #None
render_m = False
num_eval_eps = 10
vec_env = make_vec_env(env_id=env_id, n_envs=1)
model = PPO("MultiInputPolicy", env=vec_env)
model.load(model_path, deterministic=True)
episode_rewards,_ = evaluate_policy(model, vec_env, n_eval_episodes=10, render=False, deterministic=True, return_episode_rewards=True)
mean_reward, std_reward = np.mean(episode_rewards), np.std(episode_rewards)
print(f'Mean reward: {mean_reward}, Standard deviation of reward: {std_reward}')

Which outputs an mean_reward of -538.81, and not -6.44 as expected (the mean episode length is also far off).

For context: The evaluate_policy method is what's being used inside the _on_step method of the EvalCallback class to compute the mean rewards and episode lengths.

I ensured the deterministic flag, the number of evaluation episodes, the model-path, any other wrappers etc. are identical, such that I can reproduce the evaluation of EvalCallback but the results still differ significantly. Why could that be?

Cannot reproduce evaluation scores of "EvalCallback" - stable_baselines3

0 Answers0