0

I trained my PPO model while tracking the models' performance each 20k steps using the EvalCallback wrapper:

vec_env = make_vec_env(env_id=env_id, n_envs=1)
policy_kwargs = dict(activation_fn=t.nn.Tanh, net_arch=dict(pi=[64, 64], vf=[64, 64]))

eval_env = make_vec_env(env_id=env_id) eval_callback = EvalCallback(eval_env, best_model_save_path=f"./{check_root_dir}/{run}/{mod}", log_path=f"./{check_root_dir}/{run}/{mod}", eval_freq=eval_freq, deterministic=True, render=True, n_eval_episodes=10) device = "cuda" if t.cuda.is_available() else "cpu"

model = PPO("MultiInputPolicy", vec_env, verbose=0, tensorboard_log=dir, policy_kwargs=policy_kwargs, ent_coef=0.0055, vf_coef=0.6, gamma=0.8, learning_rate=0.001, device=device) # For non-dict observation space model.learn(total_timesteps=2_000_000, callback=eval_callback)

After $n$ evaluation steps and subsequent calls to EvalCallback, I get the best mean_reward value of -6.44 (according to my tensorboard and evaluations.npz) and the model is automatically saved to the specified path.

To reproduce this evaluation-score, I tried this:

from stable_baselines3.common.evaluation import evaluate_policy

render = None # #None render_m = False num_eval_eps = 10

vec_env = make_vec_env(env_id=env_id, n_envs=1) model = PPO("MultiInputPolicy", env=vec_env) model.load(model_path, deterministic=True)

episode_rewards,_ = evaluate_policy(model, vec_env, n_eval_episodes=10, render=False, deterministic=True, return_episode_rewards=True) mean_reward, std_reward = np.mean(episode_rewards), np.std(episode_rewards)

print(f'Mean reward: {mean_reward}, Standard deviation of reward: {std_reward}')

Which outputs an mean_reward of -538.81, and not -6.44 as expected (the mean episode length is also far off).

For context: The evaluate_policy method is what's being used inside the _on_step method of the EvalCallback class to compute the mean rewards and episode lengths.

I ensured the deterministic flag, the number of evaluation episodes, the model-path, any other wrappers etc. are identical, such that I can reproduce the evaluation of EvalCallback but the results still differ significantly. Why could that be?

0 Answers0