2

I am applying PPO to a custom environment and it struggles when the optimal action seems to be at the boundary of the action space. I replicated it in the following simple environment. There are two state variables: time and a variable x, and at each step the reward is -(x+a)^2 + 1, where a is a parameter. At each step the agent should choose x for the next period. x must be in a certain interval. Totally run 10 periods.

The theoretical optimal policy should be just to choose -a at each step. But what I have found is that when the optimal solution is close to the boundary, for example when period reward is -(x+0.9)^2 + 1 and the constraint for x is [-1,1], PPO at best can give me a policy that picks -1 at each period. PPO can solve the problem though, when the solution is "interior" enough (e.g., with reward -x^2 + 1), and it converges to a very deterministic policy.

I looked into what is going on during the training, and it seems that when solution is near the boundary, the policy never goes down to a very deterministic one (std of the Gaussian policy remains at a relatively high level, ~0.3-0.4), and hence it just cannot settle down. In fact, mean_reward does not fully converge (as it should have, in such a simple case).

I can kind of see why this happens, but I do not fully understand what feature of PPO lead to this, and how to fix it. I am also surprised that the optimal solution is not that near the boundary yet my PPO agent cannot find it. Is this something that can be fixed via hyperparameter tuning (if so, how?), or is there any other algorithm better for this kind of tasks (I wanted to stick to PPO because it works for the more complicated task that I am actually trying to solve)?

I am using stable-baseline 3. Here is a minimum script that can re-produce the problem

import numpy as np
import gymnasium as gym
from gymnasium import spaces
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.callbacks import EvalCallback
from stable_baselines3.common.utils import set_random_seed
from stable_baselines3.common.env_util import make_vec_env
import os
import torch

Setting up the environment

home_dir = os.path.expanduser("~") dir = home_dir+'' GLOBAL_SEED = 0 set_random_seed(GLOBAL_SEED) # This is to ensure reproducibility #%% class SimpleOptEnv(gym.Env): def init(self, max_steps=10): super(SimpleOptEnv, self).init() self.action_space = spaces.Box(low=-1, high=1, shape=(1,), dtype=np.float32) self.observation_space = spaces.Box(low=-1, high=1, shape=(2,), dtype=np.float32) self.max_steps = max_steps self.current_step = 0 self.state = None

def reset(self, seed=GLOBAL_SEED, options=None):
    super().reset(seed=seed)
    self.current_step = 0
    self.state = -0.9
    return np.array([self.state, self.current_step / self.max_steps], dtype=np.float32), {}

def step(self, action):
    self.current_step += 1
    reward = -1 * (self.state+0.9)**2 + 1
    self.state = np.clip(action, -1, 1)[0]
    done = self.current_step >= self.max_steps
    return np.array([self.state, self.current_step / self.max_steps], dtype=np.float32), reward, done, False, {}

Set up the environment

env = make_vec_env(SimpleOptEnv, n_envs=4)

Set up logging

log_dir = os.path.join(dir, 'simple_opt_tensorboard') os.makedirs(log_dir, exist_ok=True)

Define the learning rate schedule

def linear_schedule(initial_value: float): def func(progress_remaining: float) -> float: return progress_remaining * initial_value return func

Create the PPO agent with improved hyperparameters

model = PPO("MlpPolicy", env, learning_rate=1e-3, n_steps=512, batch_size=64, n_epochs=20, gamma=0.99, gae_lambda=0.95, clip_range=0.1, ent_coef=0, vf_coef=0.5, max_grad_norm=0.5, verbose=0, tensorboard_log=log_dir)

Set up evaluation callback

eval_env = make_vec_env(SimpleOptEnv, n_envs=1) eval_callback = EvalCallback(eval_env, eval_freq=1000, verbose=0, deterministic=True, render=False, n_eval_episodes=1)

Train the agent

print(f'TensorBoard command: tensorboard serve --logdir="{log_dir}"') model.learn(total_timesteps=100000, callback=[eval_callback], tb_log_name='PPO') print("Training completed. You can now run TensorBoard to view the results.")

Evaluate the trained agent

mean_reward, std_reward = evaluate_policy(model, eval_env, n_eval_episodes=10) print(f"Mean reward: {mean_reward:.2f} +/- {std_reward:.2f}")

Hank Chen
  • 21
  • 2

0 Answers0