Questions tagged [stable-baselines]

For questions that involve the stable-baseline libraries for reinforcement learning. However, note that programming questions are off-topic here. You should use this tag only to contextualize your problem and solution.

23 questions
5
votes
1 answer

Why is it recommended to use a "separate test environment" when evaluating a model?

I am training an agent (stable baselines3 algorithm) on a custom environment. During training, I want to have a callback so that for every $N$ steps of the learning process, I get the current model and run it on my environment $M$ times and log the…
4
votes
1 answer

Training an RL model with an environment where some of the variables do not change as a result of the agent actions

Typically training an RL model requires an action and an observation space, and the agent learns how its actions affect the observations. Even though there are cases where the observation space contains variables that do not change as a result of…
3
votes
0 answers

How to deal with variable action ranges in RL for continuous action spaces

I am reading this paper on battery management using RL. The action consist in the charging/discharging power of the battery at timestep $t$. For instance, in the case of the charging power, the maximum of this action can be given by the maximum…
2
votes
1 answer

PPO with GNN Actor-Critic Ignores Optimal Action Sequence with Delayed Reward

I am using Stable Baselines3’s implementation of Proximal Policy Optimisation (PPO) with a custom Graph Neural Network (GNN) architecture for both the actor and critic. My discrete action space consists of two actions, and the agent selects a…
2
votes
0 answers

PPO has hard time finding solutions at the boundary of action space

I am applying PPO to a custom environment and it struggles when the optimal action seems to be at the boundary of the action space. I replicated it in the following simple environment. There are two state variables: time and a variable x, and at…
2
votes
0 answers

How does PPO with advantage normalization learn in MountainCar-v0 before first reaching the goal state?

I'm trying to figure out how PPO ever learns anything in a sparse environment like gymnasium's MountainCar-v0 before it first ever reaches the goal state. Specifically was looking at stable_baselines3's implementation of PPO env =…
2
votes
0 answers

Compare Stable-Baselines3 vs. Tianshou

What would you recommend between Stable-Baselines3 and Tianshou for applied research in Reinforcement Learning? Can anyone provide a comparison of the strengths and weaknesses of each library? Or at least some criteria for choosing one over the…
2
votes
2 answers

A2C: Why do episode rewards reset?

I am training a model using A2C with stable baselines 2. When I increased the timesteps I noticed that episode rewards seem to reset (see attached plot). I don´t understand where these sudden decays or resets could come from and I am looking for…
1
vote
1 answer

Getting always the same action on an A2C from stable_baselines3

I'm quite new to RL and have been trying to train an A2C model from stable_baselines3 to derive an integer sequence based on 3 other input sequences of floats. I have a custom gym environment that computes the agent rewards based on the actions…
1
vote
0 answers

How to normalize input data to Reinforcement learning platform (Gym and stable-baselines)

I created a custom environment with Gym and trained it with stable baseline 3 algorithms. The observation and space action are both continues. The observation space includes 10 values and action space has 2. The action space is [0,1] and I know it's…
1
vote
0 answers

What method is better to use for a two-player reinforcement learning environment?

I want to create an RL agent for a mancala-type two-player game as my first actual project in the field. I've already completed the game itself and coded a minimax algorithm. The question is: how should I proceed? Which is the better way: to create…
0
votes
1 answer

Almost no fps improvement comparing sbx(sb3+jax) ppo with sb3 ppo

Backgroud sbx is a jax implementation of stable-baselines3. As claimed here, it can accelate RL training by jit compared to sb3+pytorch. Question I tested sbx ppo and sb3 ppo on gym env Hopper-v5. The results shows that neither sbx on cpu or sbx on…
0
votes
1 answer

Reward not improving for a custom environment using PPO

I've been trying to train an agent on a custom environment I implemented with gym where the goal is to resolve voltage violations in a power grid by adjusting the active power (loads) at each node. I tried mainly two algorithms, via…
0
votes
0 answers

Cannot reproduce evaluation scores of "EvalCallback" - stable_baselines3

I trained my PPO model while tracking the models' performance each 20k steps using the EvalCallback wrapper: vec_env = make_vec_env(env_id=env_id, n_envs=1) policy_kwargs = dict(activation_fn=t.nn.Tanh, net_arch=dict(pi=[64, 64], vf=[64,…
0
votes
0 answers

Stable Baselines 3 multiple custom networks in one agent

I'm working with an environment that can easily be subdivided into two parts, with part 1 have an indirect effect on part 2, but I can't simulate either parts alone in a realistic way. Also, both parts of the environment are, on their own,…
1
2