I trained my PPO model while tracking the models' performance each 20k steps using the EvalCallback wrapper:
vec_env = make_vec_env(env_id=env_id, n_envs=1)
policy_kwargs = dict(activation_fn=t.nn.Tanh, net_arch=dict(pi=[64, 64], vf=[64, 64]))
eval_env = make_vec_env(env_id=env_id)
eval_callback = EvalCallback(eval_env, best_model_save_path=f"./{check_root_dir}/{run}/{mod}",
log_path=f"./{check_root_dir}/{run}/{mod}", eval_freq=eval_freq,
deterministic=True, render=True, n_eval_episodes=10)
device = "cuda" if t.cuda.is_available() else "cpu"
model = PPO("MultiInputPolicy", vec_env, verbose=0, tensorboard_log=dir, policy_kwargs=policy_kwargs,
ent_coef=0.0055,
vf_coef=0.6,
gamma=0.8,
learning_rate=0.001,
device=device) # For non-dict observation space
model.learn(total_timesteps=2_000_000, callback=eval_callback)
After $n$ evaluation steps and subsequent calls to EvalCallback, I get the best mean_reward value of -6.44 (according to my tensorboard and evaluations.npz) and the model is automatically saved to the specified path.
To reproduce this evaluation-score, I tried this:
from stable_baselines3.common.evaluation import evaluate_policy
render = None # #None
render_m = False
num_eval_eps = 10
vec_env = make_vec_env(env_id=env_id, n_envs=1)
model = PPO("MultiInputPolicy", env=vec_env)
model.load(model_path, deterministic=True)
episode_rewards,_ = evaluate_policy(model, vec_env, n_eval_episodes=10, render=False, deterministic=True, return_episode_rewards=True)
mean_reward, std_reward = np.mean(episode_rewards), np.std(episode_rewards)
print(f'Mean reward: {mean_reward}, Standard deviation of reward: {std_reward}')
Which outputs an mean_reward of -538.81, and not -6.44 as expected (the mean episode length is also far off).
For context: The evaluate_policy method is what's being used inside the _on_step method of the EvalCallback class to compute the mean rewards and episode lengths.
I ensured the deterministic flag, the number of evaluation episodes, the model-path, any other wrappers etc. are identical, such that I can reproduce the evaluation of EvalCallback but the results still differ significantly. Why could that be?