I am using Stable Baselines3’s implementation of Proximal Policy Optimisation (PPO) with a custom Graph Neural Network (GNN) architecture for both the actor and critic. My discrete action space consists of two actions, and the agent selects a sequence of two actions per data point (resulting in four possible combinations).
Problem Setup
• My dataset consists of two graph classes: Class 1 and Class 2.
• The optimal action sequence for each class is:
• Class 1: (Action 2 → Action 1) → This sequence provides an immediate reward for
Action 2, and a maximum total reward when followed by Action 1.
• Class 2: (Action 1 → Action 2) → Action 1 initially gives no reward, but when
followed by Action 2, it results in the maximum total reward.
Observed Issue
• The model appears to be learning something as the loss profile looks reasonable, and it correctly learns the optimal sequence for Class 1.
• However, for Class 2, it never learns the optimal sequence (Action 1 → Action 2).
• Instead, the agent always picks (Action 2 → Action 2), which provides some reward immediately, but results in a suboptimal total reward.
• My hypothesis is that the model is ignoring the delayed reward from Action 1 and instead favours sequences that give immediate reward, leading to a local optimum.
Question
How can I encourage the agent to properly learn the optimal action sequence for Class 2, where the first action provides no immediate reward, but leads to a higher total reward?
Some potential ideas I’ve considered but would love feedback on:
1. Discount factor (γ) tuning – I have tried increasing the factor gae_lambda to 0.97 or 0.98 but this doesn't help.
2. Increasing the entropy coefficient to encourage exploration - I have tried increasing ent_coef to 0.01 - 0.05, but this also doesn't help.
3. GNN architecture modifications – Could the critic be failing to properly estimate future rewards?
4. Intrinsic motivation techniques – Would something like curiosity-driven learning help here?
Any advice or alternative suggestions would be greatly appreciated, as I feel that this is a fairly simple problem, one that the agent should be able to deal with!