PPO with GNN Actor-Critic Ignores Optimal Action Sequence with Delayed Reward

Question

I am using Stable Baselines3’s implementation of Proximal Policy Optimisation (PPO) with a custom Graph Neural Network (GNN) architecture for both the actor and critic. My discrete action space consists of two actions, and the agent selects a sequence of two actions per data point (resulting in four possible combinations).

Problem Setup

•   My dataset consists of two graph classes: Class 1 and Class 2.
•   The optimal action sequence for each class is:
•   Class 1: (Action 2 → Action 1) → This sequence provides an immediate reward for 
     Action 2, and a maximum total reward when followed by Action 1.
•   Class 2: (Action 1 → Action 2) → Action 1 initially gives no reward, but when 
   followed by Action 2, it results in the maximum total reward.

Observed Issue

•   The model appears to be learning something as the loss profile looks reasonable, and it correctly learns the optimal sequence for Class 1.
•   However, for Class 2, it never learns the optimal sequence (Action 1 → Action 2).
•   Instead, the agent always picks (Action 2 → Action 2), which provides some reward immediately, but results in a suboptimal total reward.
•   My hypothesis is that the model is ignoring the delayed reward from Action 1 and instead favours sequences that give immediate reward, leading to a local optimum.

Question

How can I encourage the agent to properly learn the optimal action sequence for Class 2, where the first action provides no immediate reward, but leads to a higher total reward?

Some potential ideas I’ve considered but would love feedback on:

1.  Discount factor (γ) tuning – I have tried increasing the factor gae_lambda to 0.97 or 0.98 but this doesn't help.
2.  Increasing the entropy coefficient to encourage exploration - I have tried increasing ent_coef to 0.01 - 0.05, but this also doesn't help.
3.  GNN architecture modifications – Could the critic be failing to properly estimate future rewards?
4.  Intrinsic motivation techniques – Would something like curiosity-driven learning help here?

Any advice or alternative suggestions would be greatly appreciated, as I feel that this is a fairly simple problem, one that the agent should be able to deal with!

score 4 · Answer 1 · answered Mar 25 '25 at 13:22

You will probably need n-step bootstrapping, eligibility traces or a state encoding with a history of actions.

Consider it from the view of the agent. Door 1 sometimes gives you nothing. Door 2 is always giving something.

If you consider it from a Q-table perspective door 1 is zero plus gamma door 2. Door 2 is door 2 plus gamma door 2. There is a clear “winner”.

When you include a history of actions door 1 leads to a state where door 1 was the last action. In this case door 2 is that maximum reward. So door 1’s update can include the discounted maximum reward, not the average of all interactions with door 2. As long as gamma * max reward is greater than door 2 * gamma door 2 it should get the gist.

PPO with GNN Actor-Critic Ignores Optimal Action Sequence with Delayed Reward

1 Answers1