PPO learning to achieve a non-Markovian task?

Question

I've been trying to train agents to achieve a non-Markovian task in a modified version of PettingZoo's Waterworld. In the task, I have two pursuers (the agents I'm training) and three evaders. I won't go deep into detail, but the task consists in the following: in the first "stage", one pursuer has to catch a specific evader, in the second stage the second pursuer has to catch a(nother) specific evader, and in the final stage the two pursuers have to catch together (i.e., at the same time) the last evader. I provide agents with small rewards every time they achieve one stage of the task, and then give them a larger reward at the end of the task. To make the task non-Markovian, agents don't have access to the current stage of the task.

I've been training agents using the Ray RLlib implementation of PPO and, very surprisingly, the agents eventually manage to achieve the task. I'll be honest in saying that I wasn't really expecting this: admittedly, their performance isn't the greatest, but they still manage to achieve the task, which I thought wouldn't have been possible with (Ray RLlib's) vanilla PPO. I've also tried disabling GAE (as I found a comment on Reddit saying that it can help in dealing with non-Markovianity (?)), but the agents still managed to learn how to complete the task.

My question is the following: assuming that there are no bugs in my implementation, is there an intuitive explanation as to why PPO is able to solve such task? Is it that the task is too simple?

score 0 · Answer 1 · answered Sep 09 '24 at 16:49

The details of the MDP are actually what matters in this case. "Non markovian" does not mean that the current state does not contain all the informations, but that in the provided state there is not enough information to draw an optimal decision

With this definition in mind, consider the case of "game of chicken"... you might not know the position of the other player, but if you want to maximize the reward, even though you don't have access to the other player information, or even the time "stage" of the game, you're optimal action is to just do what the other player is doing (both turning right or both tuning left)

Now, the whole point of the training, is to break possible equilibrium instabilities, but that's something that your agent might definitely do with enough time and luck

PPO learning to achieve a non-Markovian task?

1 Answers1